^  ropi 

RADC-TR-88-166 
Final  Technical  Report 
July  1988 


AD-A204  352 


RECON FIGURABLE  DDP  SYSTEM 


Carnegie-Melion  University 


E.  Douglas  Jensen,  Hideyuki  Tokuda,  John  P.  Lehoczky,  Lul  Sha,  Raymond  K.  Clark, 
C.  Douglass  Locke,  J.  Duane  Northcutt,  Samuel  Shipman,  Charles  Hitchcock, 

H.  M.  Brinkley  Sprunt,  Ragunathan  Rajkumar,  James  Wendorf,  Roll  Wendorf,  and 


Andre  M.  Van  Tilborg 


DTIC 


u:;lecte 

i-EB  1  3  1988 
Ob 

D 


APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED. 


ROME  AIR  DEVELOPMENT  CENTER 
Air  Force  Systems  Command 
Griffiss  Air  Force  Base,  NY  13441-5700 


89  2  13  286 


This  report  has  been  reviewed  by  the  RADC  Public  Affairs  Division  (PA) 
and  is  releasable  to  the  National  Technical  Information  Service  (NTIS) .  At 
NTIS  it  will  be  releasable  to  the  general  public,  including  foreign  nations. 


RADC-TR-88-166  has  been  reviewed  and  is  approved  for  publication. 


APPROVED: 


THOMAS  F.  LAWRENCE 
Project  Engineer 


APPROVED: 


RAYMOND  P.  URTZ,  JR. 

Technical  Director 
Directorate  of  Command  &  Control 


If  your  address  has  changed  or  if  you  wish  to  be  removed  from  the  RADC  mailing 
list,  or  if  the  addressee  is  no  longer  employed  by  your  organization,  please 
notify  RADC  (COTD)  Griff iss  AFB  NY  13441-5700.  This  will  assist  us  in  main¬ 
taining  a  current  mailing  list. 

Do  not  return  copies  of  this  report  unless  contractual  obligations  or  notice 
on  a  specific  document  requires  that  it  be  returned. 


DISCLAIMER  NOTICE 


THIS  DOCUMENT  IS  BEST  QUALITY 
PRACTICABLE.  THE  COPY  FURNISHED 
TO  DTIC  CONTAINED  A  SIGNIFICANT 
NUMBER  OF  PAGES  WHICH  DO  NOT 
REPRODUCE  LEGIBLY. 


REPORT  DOCUMENTATION  PAGE 


Form  Approved 
OMB  No.  0704-018$ 


la.  REPORT  SECURITY  CLASSIFICATION 
UNCLASSIFIED 


2a.  SECURITY  CLASSIFICATION  AUTHORITY 

M/A 


2b.  DECLASSIFICATION  /  DOWNGRADING  SCHEDULE 


4.  PERFORMING  ORGANIZATION  REPORT  NUMBER(S) 
5879 


6a.  NAME  OF  PERFORMING  ORGANIZATION 


6b.  OFFICE  SYMBOL 
(If  ippllctbloi 


8b  OFFICE  SYMBOL 
(If  applicabit) 

COTD 


1b.  RESTRICTIVE  MARKINGS 
N/A 


3.  DISTRIBUTION /AVAILABILITY  OF  REPORT 
Approved  for  public  release;  distribution 
unlimited. 


5.  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 
RADC-TR-88-166 


7a.  NAME  OF  MONITORING  ORGANIZATION 
Rome  Air  Development  Center  (COTD) 


7b.  ADDRESS  (C/«y,  State,  and  ZIP  Code) 
Grlfflss  AFB  NY  13441-5700 


9  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 

F30602-84-C-0063 


10.  SOURCE  OF  FUNDING  NUMBERS 


Carnegie-Mellon  University 


6c  ADDRESS  (City,  State,  and  ZIP  Coda) 

Computer  Science  Dept 

Dept  of  Elec  &  Computer  Engr/Dept  of  Statist: 
Pittsburgh  PA  15213-3890 


8a.  NAME  OF  FUNDING /SPONSORING 
ORGANIZATION 

Rome  Air  Development  Center 


Sc.  ADDRESS  (G'ly,  State,  and  ZIP  Code) 
Griffiss  AFB  NY  13441-5700 


1  TITLE  (Include  Serurity  Classification) 
RECONFIGURABLE  C^  DDP  SYSTEM 


12  PERSONAL  AUTHOR(S)  E.  Douglas  Jensen,  Hideyukl  Tokuda,  John  P.  Lehoczky,  Lul  Sha,  Raymond  K. 
Clark,  C.  Douglass  Locke,  J.  Duane  Northcutt,  Samuel  Shipman,  Charles  Hitchcock 


13a.  TYPE  OF  REPORT  |1 3b.  TIME  COVERED  |14.  DATE  OF  REPORT  (Year,  Month,  Day)  15.  PAGE  COUNT 

Final _ I  FROM  Apr  84  to  Oct  85  I  July  1988  534 


16.  SUPPLEMENTARY  NOTATION 
N/A 


PROGRAM 

PROJECT 

TASK 

ELEMENT  NO. 

NO. 

NO 

63728F 

2530 

01 

mil 

COSATI  COOES 


GROUP  SUB-GRO 


18.  SUBJECT  TERMS  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 
Decentralized  ControL  Concurrency  Control- 

Real-Time  Scheduling,  Dynamic  Resource  Allocation 

Distributed  Operating  System  -  (See  Reverse) 


9  ABSTRACT  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

ArchOS  is  the  operating  system  being  designed  and  constructed  as  part  of  the  Archons  research 
project.  The  goal  of  the  project  is  to  conduct  research  into  the  issues  of  decentralization 
in  distributed  computer  systems  at  the  operating  system  level  and  below.  The  primary  re¬ 
search  issues  being  investigated  within  Archons  are  decentralized  control,  team  and  consensus 
decision  making,  transaction  management,  probabilistic  algorithms  and  architectural  support 
in  a  real-time  environment.  The  ArchOS  computational  model  has  been  designed  with  both 
the  real-time  application  writer  and  the  ArchOS  implementers  in  mind.  As  a  result,  a  high 
level  of  modularity  and  malntainhbility  are  supported  by  ArchOS.  The  computational  model 
features  a  set  of  cooperating  instances  of  distributed  abstract  data  types,  called  arobjects, 
and  describes  the  various  comp^ents  of  arobjects.  The  current  set  of  ArchOS  primitives  is 
also  described  in  detail  to  ^flne  the  client's  view.  Finally,  the  rationale  for  our  design 
decisions  is  also  included. 


20  DISTRIBUTION /AVAILABILITY  OF  ABSTRACT 
□  UNCLASSIFIEDAJNLIMITED  E]  SAME  AS  RPT 


22a  NAME  OF  RESPONSIBLE  INDIVIDUAL 
Thomas  F.  Lawrence 


DO  Form  1473,  JUN  86 


□  DTIC  USERS 


Previous  editions  are  obsolete. 


21.  ABSTRACT  SECURITY  CLASSIFICATION 

UNCLASSIFIED 


22b  TELEPHONE  (Include  Area  Code)  22c.  OFFICE  SYMBOL 

(315)  330-2158 


bsolete.  SECURITY  CLASSIFICATION  OF  THIS  PAGE 

unclassified 


UNCLASSIFIED 


12.  PERSONAL  AUTHORS  (Continued) 

H.  M.  Brinkley  Sprunt,  Ragunathan  Rajkuaar,  James  Wendorf,  Roll  Wendorf,  and 
Andre  M.  Van  Tllborg 

18.  SUBJECT  TERMS  (Continued) 

Failure  Recovery 
Best  Effort 


UNCLASSIFIED 


1 


1 .  Summary 

This  report  contains  a  summary  of  research  at  Carncgie-Mcllon  University  (CMU)  supported  by  the  subject 
contract  for  the  period  10  April  1984  through  2  October  1985.  This  report  is  prepared  in  compliance  with 
CDRL  Item  No.  A006,  and  in  accordance  with  DID  DI-S-3591A/T,  Part  2b.  The  research  reported  herein  is 
conducted  in  the  context  of  the  Archons  Project,  a  multiply-funded  effort  that  currently  involves  25  people 
from  CMU’s  deparunents  of  Computer  Science,  Statistics,  and  Electrical  and  Computer  Engineering. 

Tlte  work  reported  herein  is  divided  into  three  major  sections.  Chapter  2  contains  a  functional  description 
of  the  ArchOS  decentralized  operating  system.  Chapter  3  contains  a  System/Subsystem  Description  for 
ArchOS.  Chapter  4  describes  theoretical  results  in  resource  management  that  have  been  developed  during 
the  contract  period. 


3 


\ 


1 .  Executive  Summary 

This  document  constitutes  the  Functional  Description  of  ArchOS  (CDRL  A002)  for  RADC  Contract 
No.  F30602-84-C-0063,  Reconfiaurable  OOP  System.  This  report  describes  the  functional 
description  of  the  Archons  operating  system  (ArchOS). 

ArchOS  is  the  operating  system  being  designed  and  constructed  as  part  of  the  Archons  research 
project  [Jensen  84].  The  goal  of  the  project  is  to  conduct  research  into  the  issues  of  decentralization 
in  distributed  computing  systems  at  the  operating  system  level  and  below.  The  primary  research 
issues  being  investigated  within  Archons  are  decentralized  control,  team  and  consensus  decision 
making,  transaction  management,  probabilistic  algorithms,  and  architectural  support  in  a  real-time 
environment. 


It  is  planned  that  ArchOS  will  be  developed  in  three  stages.  The  first,  called  the  interim  testbed 
version,  is  now  underway  and  will  handle  a  small  set  of  Sun  Workstations’  interconnected  by  an 
Ethernet.  It  is  expected  that  this  operating  system  will  constitute  an  existence  proof  of  many  of  the 
basic  concepts  involved  in  our  research  as  applied  to  the  construction  of  a  decentralized  operating 
system.  This  system  wifi  later  be  followed  by  a  more  complete  testbed  operating  system  incorporating 
the  lessons  learned  in  the  interim  testbed  construction.  Finally,  an  analysis  of  the  ArchOS  structure  is 
expected  to  result  in  the  construction  of  specialized  hardware  for  the  purpose  of  executing  a 
complete  ArchOS  system,  and  ArchOS  will  be  rewritten  to  run  on  it  at  that  time. 


This  document  consists  of  three  reports,  namely  ArchOS  Research  Requirements,  System 
Functionality  Specification,  and  Client  Interface  Specification.  Following  this  executive  summary, 
Chcipter  2  contains  a  description  of  the  Research  Requirements  for  ArchOS.  The  Research 
Requirements  summarizes  our  high-level  goals  and  objectives,  as  well  .os  detailing  some  of  the 
impi'icntions  ti'„„i  they  h.nve  nn  iiie  ArchOS  ciieni  i.ilt;ria<X',  the  .situoture  of  Arci.OG,  and  the  research 
.xtivilies  that  .must  be  supported  by  .ArchOS.  Chapter  3  contains  the  System  Functionality 
Sj'fuufication  fu(  ArchOS.  Tlie  System  Funciionalily  Specification  iclentif;  rs  the  Arctions  project' 
guiding  concepts,  such  as  our  viov/  of  decentralized  resomce  management  and  high-level  view  of 
system  fuiiclionalitios.  Chapter  4  describes  the  Client  Interface  Specification  for  ArchOS,  which 
defines  the  client's  view  ol  ArchOS  functionalities,  including  the  ArchOS  computational  model,  the 
operating  system  primiiives,  raid  a  r.atinnale  of  iho  design  (.locioons.  The  ArchOS  computational 
n!C!l’.l  has  been  rt.-agnorl  with,  both  the  real-time  application  wrilrci  and  Ih  j  ArchOS  impksn.cnlers  in 


mind;  as  a  result,  a  high  level  of  modularity  and  maintainability  are  supported  by  ArchOS.  The 
computational  model  features  a  set  of  cooperating  instances  of  distributed  abstract  data  types,  called 
arobjects,  and  describes  the  various  components  of  arobjects.  The  current  set  of  ArchOS  primitives 
is  also  described  in  detail  to  define  the  client's  view.  Finally,  the  rationale  for  our  design  decisions  is 
also  included. 


5 


2.  Research  Requirements 

2.1  Overview 

ArchOS  will  be  an  operating  system  implemented  as  a  vehicle  for  research  into  the  design, 
construction,  operation,  and  performance  of  decentralized  operating  systems.  It  will  be  designed  to 
provide  the  complete  execution  environment  for  a  single  distributed  program  on  a  loosely-coupled 
collection  of  processors.  For  the  purpose  of  this  document,  a  distributed  program  is  defined  as  a 
group  of  concurrent  processing  entities,. cooperatively  attempting  to  achieve  a  common  goal.  It  is  our 
aim  to  explore  the  contention  that  resource  management  decisions  made  in  a  highly  decentralized 
manner  will  produce  a  highly  robust,  extensible,  and  modular  system. 

The  development  of  ArchOS  will  take  place  in  three  steps.  Initially,  an  interim  system  will  be 
produced  using  off-the-shelf  hardware  and  existing  languages  and  support  systems.  This  interim 
system  is  the  primary  focus  of  this  set  of  research  requirements,  and  will  be  used  as  a  research 
vehicle  for  the  development  of  concepts  related  to  decentralized  operating  systems.  Later,  the 
lessons  learned  from  the  design  and  construction  of  this  first,  limited  version  of  ArchOS  will  be  used 
to  produce  a  more  complete,  decentralized  operating  system.  Finally,  the  interim  system  will  be 
discarded  and  replaced  by  a  testbed  system  which  will  incorporate  the  lessons  learned  in  the 
implementation  of  the  interim  system,  and  will  be  implemented  on  a  decentralized  computer  sy.stcm 
designed  and  constructed  .specifically  for  this  purpose. 


2.2  Goals  and  Objectives 

Despite  the  fact  that  the  first  version  of  ArchOS  is  intended  to  be  discarded  after  appropriate 
experiments  have  been  performed,  it  is  the  major  focus  of  our  initial  decentralized  operating  system 
rest.'aich  and  development  effort.  This  version  of  ArchOG  will  allow  us  to  determine  iho  degreo  to 
which  v/o  have  met  our  (joals  and  objectives  and  will  aid  in  the  identification  of  i.reas  where  more 
rJucly  ar.J/or  support  is  needed. 


The  111  ijor  goals  and  objoctives  for  this  pfiase  of  our  research  are; 

®  To  develop  an  understanding  uf  i!w  nature  ol  a  particularly  interesting  r(jdion  of  the 
space  of  decenthiUr.cd  systems,  especially  with  respect  robustness,  oxl  ,'ii;,ibi!ity,  and 
modulaiity,  through  ihc  apfilic.ition  of  fundamental  concnjit.s  relating  to  dacentriilized 
conhni,  team  and  coiijonuus  rl.;cision  rn-aking,  probabilistic  .’iigorithms,  driinbutvcd  non- 
.serializable  transardlons,  and  arciiil-  ctnr.al  raipsort  for  opeiatiivj  -.iy.-.tems. 


-»  /'o  ririinluin  hl'ih  '■siom  .i</,,:!-  oi'.it\'  in  Iho 


of  ne'e  a,  I  co-i'm joirr-l'nir;  (aolts, 


6 


providing  for  no  lasting  degradation  of  system  function  or  response  beyond  the  actual 
resource  loss  encountered.  In  particular,  the  use  of  atomic  transactions  to  maintain 
consistency  will  result  in  continuance  of  useful  computation  in  the  presence  of  lost, 
delayed,  inaccurate,  or  incomplete  information. 

•  To  provide  and  encourage  high  system  modularity.  Previous  distributed  systems  have 
been  constructed  around  the  physical  communications  limitations  of  the  system,  as 
opposed  to  being  based  on  such  modern  software  engineering  principles  as  abstract 
data  types  and  information  hiding  for  defining  modularity. 

•  To  construct  highly  extensible  facilities  which  will  limit  the  cost  of  redesign  for 
implementing  widely  divergent  operating  system  facilities  for  experimentation  with  the 
fundamental  concepts  of  interest  to  us  in  this  research. 

•  To  provide  reasonable  system  performance  to  support  applications  with  real-time 
constraints.  ArchOS  must  consider  time-critical  functions  for  which  time  constraints 
define  some  system  failure  modes  (i.e.,  issues  of  timeliness  will  not  be  ignored  as  they  are 
in  some  systems,  nor  will  they  be  dealt  with  by  simply  providing  a  large  ratio  of  available 
to  currently  used  resources,  which  is  how  virtually  all  other  real-time  systems  strive  to 
meet  response  time  requirements). 


2.3  Specific  Requirements 

1  tie  goals  and  objectives  of  the  previous  section  are  stated  in  very  general  terms.  We  believe  that 
there  are  other,  more  specific,  requirements  that  must  be  met  with  respect  to  the  structure  and 
services  provided  by  ArchOS,  the  client’s  (application’s)  interface,  and  the  research  activities  to  be 
supported  by  ArchOS.  Each  of  these  areas  is  discussed  in  the  follov/ing  sections. 


2.3.1  General  Characteristics 

ArchOS  must  provide  user  processes  with  the  minimum  services  of  an  operating  system  (e.g.,  CPU 
management,  storage  management,  I/O,  communication  management,  synchronization).  The  level 
of  facilities  need  not  be  uniform  across  the  range  of  functions  provided,  but  it  should  be  tailored  to  the 
needs  of  the  experiments  to  be  performed.  Tho  interim  ArchOS  implement  ition  specified  must  be 
imslcmc-ntsd  with  off  tho-.shelf  local  area  nnlwnrk  hardware,  using  eiastinn  'unguages  and  an  existing 
development  environment. 

a\i|  of  tho  system  resources  are  to  be  managed  by  ArchOS  in  a  fashion  tliat  "best  faciliOte-.;''  the 
needs  of  a  single  program  (possibly  with  real  time  constraints),  tliat  may  consi.st  of  a  (potentially 
larf.s)  number  of  (actually  or  appatently)  concurtonl  and  cooperating  proccsric.g  entities. 


Tlic  specification,  dcsiiin,  an<l  (to  a  lessor  extei’t)  iinplemi’iitoticn  of  the  sy 
i/ii  ;i/  ds;;uinptinn'.  about  obrolnt  ?  (v  (implote  .-.-.d  eorreci)  !'iu/..l..dgc  el 
C'f  !ho  rilob.'l  oidoring  of  o'Vcni or  the  pn-  ..nr.e  rf  ’..ienry  crir'i'  nri  b' '  : 


tsm  must  not  l)e 
/y.  i'.'in  .state, 

I  ','/  ( li  Srib"tions. 


b-\scd 
'ivtent 


7 


The  system  must  collect  data  concerning  resource  management  decisions  and  instances  of 
communication  in  order  to  provide  on-the-fly  or  after-the-fact  analysis  capabilities. 

2.3.2  Client  (Application)  interface 

ArchOS  must  have  a  client  interface  that  allows  any  of  the  cooperating  processing  entities  in  an 
application  to  request  resources  m  a  fashion  that  does  not  require  them  to  be  aware  of  the  actual 
number  and  location  of  the  resources,  nor  of  the  number,  location  and/or  resource  needs  of  other 
processing  entities  in  the  system. 

System  security  and  protection  are  not  major  issues  in  ArchOS,  and  will  not  be  handled  beyond  the 
attempt  to  keep  one  processing  entity  running  in  the  event  of  erroneous  processing  by  another 
processing  entity.  No  attempt  will  be  made  to  detect  or  prevent  deliberate  penetration.  This  is  to  say 
that  protection  in  ArchOS  will  be  defensive  and  not  absolute. 

There  are  no  requirements  for  the  design  and  implementation  of  the  system  to  be  concerned  with 
application  level  debug  or  fault  tokrance  features.  Furthermore,  the  client  interface  need  not  be 
oriented  toward  "user  convenience"  (e.g.,  reasonable  defaults,  alternative  views). 

The  client  interface  is  not  an  issue  beyond  the  requirements  described  above. 

2.3.3  ArchOS  Structure 

ArchOS  should  be  structured  as  essentially  identical  peer  modules  replicated  at  each  system  node. 
These  modules  must  allow  the  use  of  the  "most  decentralized"  algorithms  (that  are  practical)  for 
managing  several  of  the  system's  resources.  These  system  resources  should  be  chosen  to 
demonstrate  the  applicability  of  decentralized  resource  management  techniques  in  a  variety  of 
situations. 

The  degree  and  type  of  decentralization  should  bo  modifiable  for  experimentation  with  dii'fcrent 
decenlralizalion  techniques. 

2.3.4  Research  Activities 

ArchOS  must  support  research  into  the  nature  of  a  class  of  decentralized  systems,  particulatly  with 
respect  to  rehustnesG,  extensibility,  and  modularity,  throunli  the  application  of  fundaniontal  concepts 
relating  to  decentralized  control,  team  and  consen.sns  decision  making,  probnljilistic  algorithms, 
diotribut'.’d  non  sorializ'-ible  trans.'’ctions,  .'ind  archil-,’r‘';;  .v:  -jppnrt  for  opirrating  systems. 


8 


It  should  also  be  noted  that  the  above  research  activities  provide  the  motivation  for  the  Archons 
project  in  general,  and  the  ArchOS  operating  system  development  in  particular.  That  is,  the  Archons 
project  is  not  attempting  to  develop  a  computing  facility;  it  is  attempting  to  perform  research  into  the 
nature  of  decentralized  resource  management,  especially  in  a  real  time  environment. 


9 


! 

i 

3.  System  Functionality  Specification 

I 

3.1  Overview 

The  System  Functionality  Specification  of  ArchOS  describes  the  Archons  project  guiding  concepts 
and  summarizes  the  ArchOS  facilities.  In  this  chapter,  we  will  focus  on  the  higher  level  guiding 
concepts  which  have  acted  as  a  basis  for  ArchOS  development  and  summarize  the  system 
functionalities  and  new  concepts.  In  the  following  chapter,  namely  in  the  Client  Interface 
Specification,  we  will  provide  the  detailed  description  of  the  system  functionalities,  including  ArchOS 
computational  model  and  all  of  the  ArchOS  primitives. 


3.2  Archons  Guiding  Concepts 

In  this  section,  we  first  describe  the  objectives  of  the  Archons  project  and  ArchOS.  Then,  following 
the  detailed  explanation  of  our  view  of  decentralized  resource  management,  an  overview  of  the 
ArchOS  system  facilities  will  be  provided. 


3.2.1  Archons  and  ArchOS  Objectives 

The  objectives  of  the  Archons  project  in  general,  and  of  its  ArchOS  operating  system  portion  in 
particular,  differ  significantly  in  a  number  of  ways  from  those  of  the  other  distributed  system  and 
distributed/network  operating  system  efforts  of  which  we  are  aware 

The  foremost  of  these  dissimilarities  has  to  do  with  our  concentration  on  the  special  case  of 
"distribution"  which  we  term  decentralization  (explained  further  in  Subsection  3.2.2): 

•  to  explore  the  fundamental  nature  of  making  and  carrying  out  decisions  in  a  highly 
decentralized  fashion 

o  for  resource  managorriont  in  neneral, 

o  but  for  oneratinq  sv-stoms  in  particular; 

•  and  thereby  to  facilitate  the  creation  of 

o  substantively  improved  computer  systems  in  general  (including  uniprocessors  and 
computer  networks), 

o  but  ocpocially  a  novel  licccntraiizcil  coinfjuter  v/hicii  can  be  pliyoically  clicporsed 
yet  v/l lie'll  exhibits  the  optimality  of  executive  I'.vcl  global  rosource  management 
hitl’i.'ito  confincrl  to  filiysically  concentrated  (..md  highly  e.'.mralized)  uni-  and 
rnulli  f.'iccc.s.sor  compuf.;rs. 


10 


Our  principles  of  decentralized  decision  making  and  resource  management  have  wide  applicability, 
from  integrated  man-machine  systems,  through  application  software,  operating  systems,  and  down  to 
machine  hardware.  However,  we  are  focusing  on  the  OS  levels  (and  below)  for  three  important 
reasons: 

•  the  OS  is  a  constant  beneath  many  changing  applications  — 

this  provides  generality  and  lowers  system  costs  by  solving  resource  management 
problems  once  instead  of  leaving  them  to  be  solved  repeatedly  by  the  users: 

•  the  degree  and  cost  of  successful  decentralization  above  the  OS  depends  on  success  §t 
the  OS  levels  and  below; 

•  OS  problems  are  almost  always  the  most  general,  complex,  and  dynamic  (elsewhere  in  a 
system  the  resources  are  usually  more  dedicated)  — 

consequently,  solutions  at  the  OS  levels  are  more  likely  to  be  amenable  for  use  at  higher 
or  lower  levels,  while  the  converse  is  much  less  likely. 

Any  system  can  be  expected  to  have  resources  local  to  each  node,  but  we  are  disregarding  them  in 
our  research  since  managing  them  is  so  well  understood  by  comparison  with  managing  global 
resources. 

We  review  the  most  salient  aspects  of  our  quest  for  decentralized  global  executive  resource 
management  in  the  next  subsection.  Following  that,  we  list  some  various  other  distinguishing  facets 
of  our  venture. 

3.2.2  Decentralized  Resource  Management 

There  currently  seems  to  be  little  common  understanding  about  what  "distributed"  decision  making 
or  resource  management  means:  this  is  one  of  the  reasons  that  the  "di.stribiited"  and  netv;ork 
uperaiiriij  sysleiris  i:i  ihe  lilerature  aini  iaboruiories  are  so  ver/  concopiii.’-illy  SfiJ  functioually 
di.sparale.  We  have  chosen  to  uso  the  term  "decentralization,"  and  to  attempt  to  rather  carefully 
(albeit  not  foi  rnuHy)  define  v/hat  vie  moan  by  it. 

"Centralized"  and  "decentralized"  are  not  usefully  viewed  as  a  dichotomy,  but  rather  as  the 
endpoints  of  a  continuum  -  -  indeed,  as  diagonally  opposed  vortices  of  a  tnullidimcnsion.al  spetce.  We 
are  not  under  a  misconception  that  extreme  decentralisation  i.s  neress.arily  advantageous  in  all  ways 
and  under  .all  circumstances.  However,  our  experience  (both  prior  to  anri  subsequent  to  the  initiation 
of  the  Arclions  and  ArcfiOS  research)  provides  cogent  arguments  .'.ncl  concrete  evidence  that 
movoinont  from  ihe  heavily  popu'-’.:i:il  liighly  cculi alized  .sub'ipacea  c.'.n  i>,:  iuv  I'Mct 


11 


in  some  applications  we  are  familiar  with,  such  as  supervisory  real-time  control  (e.g.,  combat  platform 
management  and  factory  automation),  more  decentralization  of  resource  management  offers 
improvements  in  certain  system  attributes  like  robustness  and  modularity.  In  order  that  we,  or  any 
designers  of  a  particular  system,  be  able  to  scientifically  position  ourselves  well  in  this  space  from 
maximally  centralized  to  maximally  decentralized,  far  too  little  knowledge  exists  today  about  its 
decentralized  boundary  conditions. 

It  became  apparent  to  us  that  these  issues  ought  to  be  dealt  with  explicitly  and  systematically  from 
the  ground  up,  not  in  the  prevalent  ad  hoc  adaptive  (albeit  safer  and  faster)  way,  if  the  many  attractive 
promises  of  a  physically  dispersed  computer  were  to  be  realized.  Thus,  we  launched  an  extensive 
search  for  the  limits  of  resource  management  decentralization,  divided  into  two  areas:  logical  and 
physical.  (Computer  scientists  sometimes  imagine  incorrectly  that  logical  things  are  innately  more 
conceptually  interesting  than  are  physical  things;  the  opposite  seems  true  to  us  in  this  case.) 

3.2.2. 1  Logically  Decentralized  Resource  Management 
One  of  our  first  steps  was  to  create  a  conceptual  model  of  the  space  of  logical  decentralization  of 
decision  making:  the  most  detailed  of  its  incarnations  can  be  found  in  [Jensen  81  a].  It  can  be  applied 
at  different  levels  of  abstraction,  from  one  instance  of  one  decision  about  one  resource,  through  all 
instances  of  all  decisions  about  all  resources.  Our  model  is  germane  to  the  management  of  local  or 
global  resources.  In  it,  decentralization  is  founded  on  multilateral  management;  not,  for  instance,  on 
the  more  common  theme  of  resource  or  functional  partitioning  (which  leads  to  autonomy  as 
maximally  decentralized,  which  we  reject).  Our  model  expresses  the  degree  of  decentralization  as 
being  determined  by  several  factors,  crudely  summarized  as  follows; 

•  the  percentage  of  resources  involved; 

o  the  percentage  of  decision  makers  which  participate  (depending  on  the  level  of 
abstraction  u:Kler  conoidcralion,  functicn.i!  partitioning  and  successive  techniques  such 
as  round  lobin  may  be  placed  at  the  corurnlizcd  end  of  thus  axis): 

->  the  extent  to  which  all  decision  makers  must  become  involved  before  :t  decision  has  b  on 
completed  (note  that  resource  partitioning  and  functional  specialization  arc  highly 
centralized  by  this  metric); 

« the  degree  of  equality  of  decision  maker  authority  and  responsibility  (this  axis  places  a 
premium  on  peer  relationships,  in  contrast  with  the  ubiquitous  hierarchical  ones). 


To  iliis  vorciion  via  fAibsequently  adried  a  negotiation  axis. 

/■'t  ih.-  uiiaiiii'iii',  .'.lor'.'  ceiil''r,ii/  ri.  ;;i:il  <.!i  iho  I'i-.jotiaticn  nxis, 


;.''U-h  ileciciun  i.'.  miu'e  by  a  ooiiochon 


12 


of  entities  which  work  as  a  "team"  to  move  the  overall  system  toward  its  goals.  Any  team  member  is 
allowed  to  make  certain  (not  necessarily  fixed)  decisions  without  necessarily  gaining  the  concurrence 
of  the  other  team  members;  these  decisions  may  be  constituent  subdecisions  or  different 
instantiations  of  the  same  decision  (this  difference  is  represented  on  other  axes  of  the  model).  Any 
team  member  may  seek  information  which  will  improve  the  quality  of  its  decisions  [Marschak  72].  A 
well-known  example  of  team  decision  making  is  routing  in  the  ARPANET  communication  subnet 
[Ahuja  82]. 

At  the  opposite,  more  decentralized  endpoint,  each  member  in  a  collection  of  decision  makers 
develops  hypotheses  (deductions  and  assumptions)  with  associated  probabilities  —  these  may  be 
based  on  some  form  of  partitioned  competence  (designated  elsewhere  in  the  model)  or  disparity  of 
information  (which  may  again  be  a  logical  factor,  or  a  physical  one  as  discussed  in  the  following 
subsection).  To  make  a  decision,  members  exchange  these,  reason  about  and  modify  them,  making 
compromises  as  necessary,  and  in  this  way  enhance  the  marginal  viewpoints  to  a  more  global  view. 
This  activity  must  somehow  converge  to  a  single  consensus  decision,  perhaps  by  a  formal  method 
such  as  that  of  DeGroot  [DeGroot  74],  or  by  heuristics  such  as  inference  rules  and  algorithms  —  the 
latter  are  employed  by  some  areas  of  artificial  intelligence  such  as  problem  solving  and  expert 
systems  (e.g.,  the  HearSay  system  [Erman  73],  [Erman  79]).  (Note,  however,  that  Hearsay  was  quite 
centralized  by  our  standards:  logically,  because  the  knov/ledge  sources  were  functionally  specialized; 
and  physically,  due  to  the  shared  global  state  "blackboard".)  This  technique  is  somewhat  similar  in 
spirit  to  the  "divide  and  conquer"  approach  of  algorithm  design  but  lacks  the  optimality  of  the  full 
Bayesian  method,  because  the  joint  information  has  been  sacrificed.  That  is,  the  various  inter¬ 
dependencies  are  only  approximated  by  the  indirect  approach  of  reaching  a  consensus. 

We  are  interested  in  the  conditions  under  which  different  degrees  of  logical  decentralization 
according  to  our  model  offer  how  much  of  v/iiich  attribute.';,  .'ind  the  tradeoffs  involved,  in  managing 
global  resources.  But  the  regiun  of  priinaiy  interest  to  us  in  this  multidimensional  space  of  logical 
decentralization  is  where  each  global  decision  is  made  multilatcrally  by  a  group  of  peers  through 
rtegotiafion,  compromise,  and  consensus. 

According  to  our  view,  most  re.courCG  management  is  higiily  logically  centralized,  even  in  trie  myriad 
notv/ork  and  distributed  operating  systrims  we  are  av.rare  of. 


13 


3. 2. 2. 2  Physically  Decentralized  Resource  Management 

We  have  long  argued  that  the  important  benefits  of  having  system-wide  resource  management  at 
the  operating  system  level,  routinely  provided  by  a  computer,  are  not  available  to  many  systems  —  the 
reason  is  that  those  systems  consist  of  multiple  nodes  which  must  be  physically  dispersed  (for 
functionality,  reliability,  and  logistical  reasons). 

Unfortunately,  operating  systems  as  presently  conceived  are  highly  and  inherently  centralized  in 
several  critical  respects.  Perhaps  most  importantly,  they  are  based  on  some  very  strong  premises 
about  time  —  e.g.,  that  communication  delays  due  tg  physical  dispersal  within  the  operating  system 
are  practically  negligible  with  respect  to  the  rate  at  which  the  system  state  changes  (note  that  the 
same  effect  can  occur  on  a  single  VLSIC/VHSIC  chip).  This  leads  to  the  presumption  that  it  is 
possible  (and  even  cost-effective)  for  all  processes  to  share  as  complete  and  coherent  a  view  of  the 
entire  system  state  as  may  be  desired  (e.g.,  that  a  single  global  ordering  of  events  can  be 
established).  Another  class  of  centralized  operating  system  premises  has  to  do  with  the  types, 
frequencies,  and  effects  of  faults,  errors,  and  failures.  Both  the  time  and  fault  premises  are  rational 
given  the  historical  evolution  of  operating  systems  in  the  context  of  shared  primary  memory  (i.e., 
uniprocessors  and  multiprocessors).  Unfortunately,  many  of  these  premises  go  unstated  (e.g.,  in 
operating  system  texts  and  papers),  and  are  either  forgotten  or  assumed  to  unquestionably  alv/ays 
hold. 


Our  focus  is  on  achieving  the  global  executive  level  resource  management  for  a  physically 
dispersed  system  to  be  a  computer  in  the  same  sense  thcU  a  uniprocessor  or  multiprocessor  is. 
However,  we  are  not  restricting  ourselves  to  virtual  uniprocessors  —  is  it  frequently  beneficial  (e.g., 
improved  fault  recovery  and  performance)  for  some  image  of  the  composite  and  decentralized 
structure  of  tfie  sofhvare  or  hardware  to  (occasionally  or  optionally)  be  made  or  left  visible  to  the  user. 


Pioseniiy,  Archon.s  .appears  to  be  oa.scntially  aicne  in  stresciiig  unification  .at  tho  operating  ay  oom 
levels;  the  dominant  theme  in  distrib'ited  system’  projects  today  is  "autonomy, ”  The  only  popular 
aitornati'/c'  to  convention.'il  centraliz'^'d  cemputors  is  computer  nciwoiks.  conventional  gerioric 
computer  network  can  be  characterized  as  follows: 

o  each  computer  i.s  (functionally  and  often  admini.otratively)  autonomous  with  its  own  local, 
centralized  operating  system; 

*  all  iho  computers  are  connected  to  i  comrniinic.ations  subnetwork; 


«» each  computer  h  i  network  :,'a'v<?r  utility  coftv.are  (for  tre.u' 'v.Tt  protocol: 
conv'.,.r;on:',,  .iii.  I  lil',;,.)  ra.ihui  ui  for  }i).  ui  ‘.o  '.lo  recou.a.'.o  h  limj  (eg.,  nl  ; 
m.iil.  ■'iftu,)l  lc-rt!iiu,i:'.); 


ri.atiiing 
'nc.fc'ui, 


14 


•  there  may  be  higher  layers  of  software  for  specific  applications  (e.g.,  banking,  military 
C^),  perhaps  giving  the  users  some  unified  perception  of  the  system. 

A  network  normally  is  supplied  with  a  so-called  "network  operating  system",  which  tends  to  simply 
be  the  collection  of  network  server  utilities.  Historically  it  has  been  constrainted  to  being  a  guest  of 
the  local  operating  systems;  recently,  more  indigenous  (and  thus  more  effective)  network  operating 
systems  are  developing  (e.g.,  [Rashid  81]).  A  few  recent  networks  and  their  network  operating 
systems  aspire  to  eventually  make  gradual  movement  in  the  direction  of  greater  operating  system 
coordination  (e.g.,  [Spice  79]). 

Many  applications  need  nothing  more  than  long-haul  resource- sharing  or  value-added  networks,  or 
local  area  networks  of  personal  workstations.  But  not  in  the  cases  of  concern  to  us:  the  very  complex 
problems  of  achieving  system-wide  resource  management  are  forced  up  to  the  user  level  where  they 
are  more  difficult  (having  less  access  to  lower  level  resources  and  receiving  little  assistance  and 
perhaps  even  resistance  from  the  local  operating  systems);  and  where  they  must  be  solved  repeatedly 
if  there  are  multiple  users,  instead  of  once  by  the  system  designers.  The  unsurprising  consequence  is 
that  these  applications  with  substantive  state  change/visibility  ratios  must  suffer:  because  of  the 
solopsistic  local  operating  systems,  syctem  robustness  is  poor,  modularity  is  compromised, 
performance  (e.g.,  concurrency)  is  reduced,  and  total  system  cost  is  increased. 

We  fought  with  the  dilemmas  of  this  dichotomy  between  computers  and  computer  networks  during 
the  past  decade  of  our  experience  designing  distributed  systems  and  realized  that  having  a  phy.sically 
dispersed  computer  requires  a  functionally  singular  operatinri  system  (as  opposed  to  a  network  of 
independent  private  operating  systems).  The  primary  obstacles  to  be  overcome  are  that: 

•  communication  within  the  operating  system  is  inaccurate  and  incomplete  with  respect  to 
system  state  changes; 

•  and  the  types  and  etfocts  of  faults,  errors,  and  failures  encountered  in  multinodo 
phy.sic.ally  dic-porsed  syctorr’s  differ  significantly  in  both  degree  and  hind  from  those  in 
single  nod.?  systcnis  —  thi.s  is  even  morn  j.-ionouticcd  in  .i  (Jor,ontrali:'ed  computer  than  in 
a  computer  n-3lv;ork. 

3. 2. 2. 2.1  Accommodating  Imperfect  Information 

High  degrees  of  physical  decentralization  imply  that  resource  management  rlecisions  routinoly  must 
he  "be.st  effort",  based  on  imperfect  quantity  and  quality  of  information  —  liie  vittuail/  ubiquitous 
"garbage  in,  garbags  out"  charticferization  of  computers  is  unroali.stic  and  cannot  be  tolerafcd  in  .a 
physically  disnoiSHfl  rn(.iltinede  '..omputer.  This  f:>.i-,pecti'/o  is  ssmiowhat  f.'.miliar  above  tlu;  l-v.'ol.s 
f'.'.'l.,  in  cr 'tain  ariihr-'al  intel!'!'!  -ri  ,•  vvaii'.),  arid  !>:  lew  Ih'-'ui  (  sg.,  in  tiyna.ndc  i.ii.'!i'..sl:..':i  is'ul'Ut 


15 


routing).  But  at  the  OS  levels  (as  in  most  software),  it  is  a  foreign  outlook  which  is  incompatible  with 
the  current  state  of  the  art.  Consequently,  new  problems  have  to  be  solved  in  the  design  of  the 
decision  algorithms,  such  as:  picking  thresholds  of  result  acceptability,  and  specifying  them  to  the 
decision  makers:  determining  what  "value"  the  completeness  and  accuracy  of  information  utilized 
contributes  to  the  "quality"  of  a  decision  result. 

In  thinking  about  these  issues,  it  becomes  clear  that  while  the  logical  and  physical  aspects  of 
decentralized  decision  making  are  conceptually  distinct,  they  strongly  interact.  For  example,  the 
decision  convergence  time  may  include  acquiring  suitably  valuable  quantity  and  quality  of 
information,  as  well  as  negotiating. 

An  unavoidable  characteristic  of  a  physically  dispersed  machine  is  a  significant  increase  in  the 
indeterminism  of  its  behavior.  The  centralized  mind  set  is  ihat  not  only  ends  but  also  means  at  all 
levels  ought  to  be  entirely  deterministic;  it  is  normally  affordable  to  closely  approximate  this  in  a 
centralized  machine,  and  instances  to  the  contrary  are  dealt  with  in  various  ad  hoc  fashions.  Our 
position  is  that  considerable  indeterminism  is  the  normal  case  in  decentralized  resource 
management,  and  can  be  exploited  to  advantage  (e.g.,  improved  robustness  and  performance)  rather 
than  merely  tolerated.  Dynamic  packet  routing  demonstrates  our  position,  and  we  have  done  so 
(transparently  to  the  users)  in  a  network  operating  system  [Sha  83a]. 

3. 2. 2. 2. 2  Faults,  Errors,  and  Failures 

In  a  physically  dispersed  multinode  sy.stem  (whether  network  or  computer),  reliability  problems  are 
worse  than  in  nondispersed  and  uninode  systems,  particularly  when  cofisidered  in  light  of  the 
imperfect  information  issues  discussed  above.  For  example,  concurrency  control  and  failure 
atomicity  become  vastly  more  complicated,  far  beyond  the  realm  of  current  centralized  operating 
system  conceptions.  Computer  netw'orks  h.T/e  more  relevant  technology  in  this  respect,  but  most  of  it 
is  actually  inr^pirational  rnlh  ■•r  than  directly  tfarrsferable  tn  dpcenrr.’tlired  .and  computer. 

We  adcirese  failure  munu'jement  and  rorovery  \vihiin  an  rrj'stcm  by  thinking  of  the  OS 

state  as  a  special  kind  of  distributccl  dataO.rs-e  which  is  approximately  replicated  at  each  node.  This 
suggests  that  an  atomic  transaction  facility  for  use  by  the  OS  (and  perhups  by  higher,  e.g., 
application,  levels)  be  incorporated  in  e.?.ch  instance  of  its  kernel  ( [.Jensen  81b],  although  we  made 
this  pivotal  ':J'-.‘sign  decision  in  1970  as  a  result  of  several  enliglitcninn  ilisciissions  v/ith  Gerarrl  Le 
Lann  about  his  distribute;)  database  conc.irrency  control  research).  As  a  conseciucnce,  Hirer!  classes 
of  .significant  i  .search  i.sr.iii.;;  ari.so. 


16 


One  is  that  both  the  services  and  structure  of  the  OS  ought  to  be  substantively  affected  by  the 
availability  of  atomic  transactions  as  kernel  primitives.  This  is  essentially  virgin  territory;  the  few 
approximations  to  atomic  transactions  at  the  01  level  have  been  ad  hoc:  in  fact,  they  have  not  been 
explicitly  viewed,  designed,  and  exploited  as  transactions. 

Secondly,  the  overhead  (especially  communication)  of  atomic  transactions,  which  is  always  a 
concern,  becomes  of  paramount  importance  at  the  OS  kernel  level.  We  have  determined  that  one  can 
achieve  great  acceleration  without  degrading  flexibility  with  thoughtfully  crafted  hardware 
mechanisms  at  the  disposal  of  software  modules  which  establish  the  desired  policies. 


The  third  class  of  research  issues  has  to  do  with  the  need  for  insightful  reconsideration  of  atomicity 
itself.  While  the  inclusion  of  atomic  transactions  in  our  OS  was  inspired  by  their  contributions  to 
conventional  database  systems,  our  transaction  facility  differs  radically  in  several  respects  from  those 
used  in  that  context.  Examples  include  the  following. 

The  conventional  serializability  theory  of  concurrency  control  includes  the  assumption  that 
consistency  and  correctness  result  from  a  single  transaction  executing  alone:  this  leads  directly  to 
the  same  result  for  all  serializable  schedules.  An  advantage  of  serializability  is  that  it  is  completely 
general  and  works  v/ithout  requiring  knowledge  about  either  the  database  or  the  transactions;  but  a 
disadvantage  is  that  it  cannot  exploit  such  knowledge  which  may  be  available  in  any  specific  case 
(such  as  in  an  OS).  The  cost  of  this  generality  is  that  serializability  can  exclude  consistent  and 
correct  schedules  which  provide  higher  concurrency  than  those  it  permits. 


Database  researchers  have  begun  to  study  nonserializable  consistency  control  methods  in  search 
of  greater  concurrency,  but  a  major  difference  is  that  a  transaction  can  no  longer  be  regarded  as  if  it 
were  executing  alone.  Consequently,  the  consistency  and  correctness  properties  of  nonserializable 
eciiccluling  rules  do  not  follow  auton-atically,  they  must  be  proven.  So  that  cvc.-y  a!fc:v.pt  to 


.Mfi 


fi  y#* 
iir.v. 


noncoriaiizability  is  not  burdened  with  inventing  its  own  rules  and  proving  their  ,oroportios,  a  formal 
theory  of  nonsorialzable  concurrency  control  is  needed.  Because  it  is  already  known  that, 
serializability  theory  provides  thr?  Iiiyhest  degree  of  concurrency  possible  when  using  only  the 
classically  defined  transaction  syntax,  some  additional  kind  of  information  is  required  if 
nonserializabilty  is  to  perform  better. 


Researchers  other  than  oursolvo-::  seem  to  bo  focused  exclusively  on  exploitation  of  tran.saction 
serv.nniics,  developing  syntactic  r.triicture;;  to  aipport  prO'jramtiiivrs’  specification  of  their  own 
applio.'iiicn  (l<;-(u.ni lent  r.r.lieduliraj  luieo.  Alt  j  rogranitncrs  iiiv-.i a.’. I  v/ifli  '11(7  oi  nd  d.iial'C,  ■  n-ust 


17 


understand  the  details  of  each  other’s  transactions,  and  every  programmer  is  responsible  for  the 
consistency  and  correctness  properties  of  his  own  rules.  Such  extensive  use  of  global  transaction 
semantics  (e.g.,  the  "break  point  specifications"  in  [Lynch  83]  and  "lock  compatibility  tables"  in 
[Schwarz  82])  allows  very  high  degrees  of  concurrency,  but  appears  to  limit  this  approach  to  rather 
static  and  specific  situations. 

Modularity  is  recognized  to  be  extremely  valuable  in  software  engineering  generally;  we  consider  it 
a  critical  attribute  in  transaction -based  distributed  computations,  particularly  decentralized  operating 
systems.  Therefore,  we  have  created  a  different  and  more  decentralized  theory  of  nonserializable 
concurrency  control  which  seems  improved  with  respect  to  modularity;  when  a  programmer 
schedules  one  of  his  transactions,  he  need  know  only  the  agreed  upon  transaction  syntax,  the  details 
of  his  own  transaction,  and  the  consistency  constraints  of  the  database  subset  affected  by  this 
transaction.  We  define  a  new  transaction  syntax,  called  compound  transactions  (of  which  nested 
transactions  are  a  special  case),  and  its  associated  generalized  setwise  serializable  scheduling  rules 
(of  which  serializability  is  a  special  case).  Our  schedules  are  complete  in  the  sense  that  for  any 
consistency  and  correctness  preserving  schedule,  there  exists  an  equally  consistent  and  correct 
setwise  serializable  schedule  which  provides  at  least  as  much  concurrency. 

An  important  implication  of  our  different  approach  to  transactions  is  that  failure  management  and 
recovery  must  be  re-evaluated.  The  usual  notion  of  failure  atomicity,  commonly  thought  to  bo  an  end, 
is  in  fact  a  means  to  maintaining  consistency  and  correctness  in  the  event  of  failures  when  using 
serializability  theory,  where  a  transaction  cannot  be  commited  until  all  its  actions  are  successful  and 
in  stable  storage.  Higher  concurrency  car.  be  achieved  by  determining  conditions  which  permit  a 
transaction  to  commit  completed  steps  before  the  end  of  the  transaction.  Such  conditions  are  a 
natural  consequence  of  generalized  setwise  serializable  transactions,  enabling  us  to  introduce  the 
ccncept  of  luiluh^  safety.  In  [.Sha  tVt],  we  form.ilize  our  theory,  and  its  properties  of  consistency, 
correctness,  mcdulnrity,  o;Jtitnality,  ''ompictrness,  and  f.ai!ur«j  safety. 

Oti'iCr  major  rl.'.partures  wo  must  inal^e  from  normal  atomic  transaction  facilities  include: 

Instead  of  bo'iig  above  an  with  the  corresponding  functionahly  to  draw  on,  our 
tranc.'iciion  facility  is  beneath,  inside  the  OS  kernel.  This  affects  the  facility  dobign 
substantially. 

j  Rather  th.an  handling  'implc  databaso  objects  such  as  records  and  files,  it  nuist 
accontnujOntc  the  far  inrj.  c  complex,  .abstract,  ami  dynamic  data  tycics  found  in  an  O'.'). 

'»  .'>11  objcCi  i'.  not  ncc'Cb.’rily  loceiod  .a  ..i  single  no'!:;  —  a  ciuqle  in:;*,  ;nc.;  of  it  nr;'/  '.;o 
pli/..!.’  illy  1  '  i  -■  nmhiijr'  nii'l-.-s. 


18 


Note  that  the  work  outlined  above  has  applicability  beyond  our  motivation  to  enable  the  creation  of 
a  logically  and  physically  decentralized  operating  system  (a  id  computer)  which  is  extremely  reliable 
and  modular. 

3.3  ArchOS  Facilities 

The  ArchOS  system  is  intended  to  be  a  vehicle  with  which  our  research  issues  can  be  investigated, 
rather  than  an  operational  applications  programming  environment.  Thus,  the  ArchOS  facilities  must 
have  sufficient  functionalities  to  allow  test  applications  to  be  constructed  with  which  to  study  ArchOS’ 
characteristics,  but  they  need  not  provide  a  particularly  complete  set  of  facilities.  The  set  of  facilities 
provided,  then,  will  be  evaluated  with  respect  to  their  support  of  Archons’  primary  research  goals, 
rather  than  with  respect  to  an  operationally  complete  functionality. 

Nevertheless,  the  set  of  functions  described  in  the  Client  Interface  Specification  (i.e.,  Chapter  4) 
should  describe  a  set  of  facilities  each  of  which  is  functionally  closed:  i.e.,  each  function  provided  is 
complete  so  that  its  use  can  be  measured  with  respect  to  Archons’  research  issues.  Thus,  for 
example,  in  the  handling  of  transactions  at  the  client  interface,  all  of  the  important  issues  in 
transaction  management  are  covered,  even  though  other  functions,  such  as  file  management,  may  be 
missing  or  skimmed.  It  is  not  to  be  assumed  that  missing  or  incomplete  functionality  in  the  int:;rim 
tesbed  ArchOS  represents  valueless  functionality,  but  rather  that  the  issues  involved  with  such 
functions  are  not  within  the  primary  Archons  research  interests.  Functionality  in  such  areas  is 
expected  to  be  much  more  complete  in  later  versions  of  ArchOS  incorporating  the  results  of  the 
research  performed  during  this  early  ArchOS  implementation. 

In  this  section,  we  describe  the  important  ArchOS  facilities  focusing  on  our  design  decisions  and 
their  relationships  with  Archons’  research  issues. 

•3.3.1  Aroi.'ject/Process  Management  Facilities 

The  nrchject  aivl  proce:;;?  ninnacjorncnt  lacility  prr,  /idc-c  thy  basic  iunclions  to  create,  remove,  bind, 
or  unbind  cooperating  procescing  entities  in  ArchOS.  ArcbOo  provides  coniplnioly  not'.vc)rk- 
traneparont  arobject/procoss  management.  That  is,  creation  and  destruction  of  arobjoct''.  and 
processes  can  be  performed  without  knowing  the  location  of  the  target  aroLject  or  piocess. 

Vye  view  an  arobjoct  os  a  basic  module  for  enubodying  a  distributed  a.bstiact  data  type.  In  parliiailni’, 
wc  deH.icled  to  treat  an  arcliiect  an  .in  ’ctivn  system  entity  raiher  Ili.iu  a  entity.  The  ni.-jci- 

icar-on  ir,  'b.a  an  .jr-bject  can  b"  \  '■er/  ii.'luial  t;:ii!..!iti-!  block  for  d  '.'.m  of  di  -  -l:’.  d  j  .-nvn'ia:.  S'ld 


19 


for  construction  and  organization  of  cooperating  resource  managers.  In  particular,  there  are  several 
advantages  related  to  the  reliability  of  the  system.  First,  we  can  easily  define  an  autonomous  module. 
It  is  easy  for  an  arobject  not  only  to  initialize  its  computational  state  by  itself,  but  also  to  recover  its 
computational  state  by  itself.  In  other  words,  by  having  a  single  INITIAL  process  in  each  arobject,  a 
designer  can  give  responsibility  for  recovery  to  the  INITIAL  process.  Second,  unlike  traditional 
procedure  invocation  in  abstract  data  types,  a  caller  cannot  invoke  an  operation  of  an  arobject  in  a 
master-slave  manner,  but  must  use  a  form  of  rendezvous.  The  receiver  therefore  has  the  right  to 
accept,  reject,  or  delay  the  requested  function.  Finally,  the  degree  of  parallelism  within  an  arobject 
can  be  dynamically  changed  in  many  ways.  For  instance,  a  process  can  be  created  in  a  different 
node  to  perform  a  requested  computation  on  demand  or  many  processes  can  be  pre-created  to 
accept  a  particular  type  of  request. 

3.3.2  Communication  Facilities 

ArchOS  communication  facilities  provide  support  not  only  for  a  conventional  client-(single)server 
model  but  also  for  cooperation  among  multiple  servers  in  a  distributed  environment.  Thus,  the  system 
supports  a  Request-Accept-Reply  type  of  communication  among  cooperating  arobjects  in  either  a 
synchronous  or  an  asynchronous  manner. 

To  support  such  a  cooperating  server  model  which  will  utilize  our  notion  of  distributed  best  effort 
decision  making^,  we  also  created  a  one  to  many  type  of  communication  among  cooperating 
arobjects.  That  is,  a  process  can  invoke  an  operation  on  a  group  of  particular  arobject  instances  at 
once  and  does  not  need  to  block  itself  until  one  or  more  replies  to  come  back.  In  this  way,  the 
requestor  can  also  perform  its  own  activity  while  the  servers  are  working  on  the  requested  operation. 

As  for  various  server  structures,  we  are  interested  in  a  collection  of  servers  which  can  increase  the 
availability  and  reliability  of  service.  The  productivity  of  tire  servers  may  al.so  increase  together  in  a 
collection  of  cooperating  workers.  Rven  for  the  single  server  case,  it  is  easy  to  make  an  arobject 
autonomous  and  cooperative,  since  there  is  it  least  one  process  man.igif’g  it.3  service.  Unlike  the 
master-slave  relationship  in  remote  procedure  calls,  the  server  can  control  not  only  the  sequonce  of 
incoming  messages,  but  also  the  e.xccution  orrf  ir  of  the  actual  services. 

Another  important  characteristic  of  the  communication  facility  is  that  eacli  communication  primitive 
is  handled  as,  so  callc'd  a  compound  trcinsaclion  (see  Section  3.3.3)  by  ArchOS.  Thu.s,  the  .system 


•  ...  fMK,'  'I."  i  ■.  ,t  'Ifoit  h.l.i'i  iii.lkiiK|  .i!  pi'  i  tniiiii.  ■■  Uni  point  lio-.v  vi'".'.'  I’,  t  It  '•  fri:;  to 

r  ni.|  i( f(M  ii,'o;-,ii)ii  I'l.ii  mil  .i(.li/.ii,i|  iiic:!iMi|'l<'l(;  in.'i'.ii.oi.it.  iP.’oiilMl'  ,(  i  .  it:.-  t:'-;,;  p..'  .,ul<;  .v.i'/. 


20 


can  guarantee  the  atomic  property  of  each  communication  activity.  In  other  words,  the  system 
performs  either  all  of  the  message  communication  activities  completely  or  nothing  at  all. 


3.3.3  Transaction  Facility 

The  ArchOS  transaction  facility  provides  two  types  of  transactions,  elementary  transactions  and 
compound  transactions,  which  may  be  nested  in  arbitrary  combinations.  Elementary  transactions 
correspond  to  the  traditional  nested  transaction  facility  [Moss  81]  if  only  elementary  transactions  are 
nested.  Compound  transactions  (Sha  84]  provide  more  potential  concurrency  than  elementary 
transactions  and  differ  from  elementary  transactions  in  that  when  a  compound  transaction  commits 
the  effects  of  the  transaction  commit  processing  takes  place  immediately. 

Unlike  a  traditional  database  transaction  facilities,  ArchOS  itself  will  be  built  by  using  compound 
and  elementary  transactions.  These  OS- level  transactions  should  facilitate  the  maintenance  of 
system  consistency  while  simplifying  data  sharing  among  multiple  processes.  Another  different 
characteristic  is  that  the  ArchOS  transaction  facility  does  not  provide  failure  atomicity  and 
permanence  for  all  of  the  data  accessed  by  a  transaction.  Rather,  only  the  data  items  that  have  been 
explicitly  declared  to  be  atomic  have  these  properties. 


V/e  have  attempted  to  view  ArchOS  as  a  highly  decentralized,  real-time  application  built  on  top  of 
the  kernel  support  facilities  at  various  times,  and  that  has  led  us  to  view  the  arobject  as  a 
computational  entity  that  we  would  u-se  within  ArchOS  (insofar  as  ,oo.3Sible),  as  well  as  at  the 
application  level.  While  specifying  the  services  to  be  provided  by  ArchOS  and  its  behavior  under  all 
conditions,  it  became  apparent  that  ArchOS’  internal  system  data  objects  should  possess  several 
characteristics.  In  particular,  the  following  characteristics  v/ere  desired: 


•  Often,  several  different  arobject.s  will  operate  on  specific  data  items  (directories,  queues, 
and  :5o  on).  These  sh  irod  data  objects  will  reside  in  an  arobioct,  so  access  can  be 
oi'uiu’iii.tled  by  iliO  normal  .-lOliiO'A  r.iimmurii<:'».uon  f.-irili;i  (p.-u'icularly  Accept 
ptimilives),  as  well  as  by  tneans  of  customized  code  in  the  arobject's  processes,  tint,  as 
the  following  point:;  illu.strato,  a  tiinh  :r  level  coordination  v/ill  often  be  desirable. 


•  At  tiine.s,  it  is  necoccary  to  change  several  different,  ycit  related,  data  items  as  a  unit  (for 
example,  updating  all  of  the  copies  of  an  entry  in  a  partiiilly  replicated  directory  or  moving 
an  tdcinont  from  one  (^uou(^  to  anothot).  If  such  atomic  updates  could  be  perfoimod,  then 
it  would  l;o  much  easier  to  transform  one  consistent  slate  of  a  set  of  <lata  items  to  another 
consistent  state. 


9  Somr;  data  items  mu-.t  t;  3  |)Oi manor, t;  that  is,  their  r.t  d:'  :..iiould  bo  reliably  iuuintcined  for 
the  lif.3  of  the  s/stom. 


21 


These  attributes  could  all  be  provided  by  ArchOS  by  means  of  the  communication  facilities  in 
conjunction  with  custom- written  code  and  appropriately  defined  locks  or  semaphores  and  critical 
regions.  However,  we  desired  a  more  structured  approach.  All  of  the  above  capabilities  are 
supported  by  traditional  database  transaction  systems  [Gray  77].  In  fact,  such  transaction  systems 
can  provide  even  more  powerful  properties  (specifically,  failure  atomicity  and/or  serializability).  It 
was  felt  that  this  additional  structure  would  ease  the  programmer's  burden,  while  also  decreasing  the 
chances  for  programming  errors,  by  making  modular  programming  more  natural.  Indeed,  the  use  of 
compound  transactions  promotes  modular  construction  of  programs  by  causing  the  transaction 
author  to  think  in  terms  of  consistency  preserving  transformations  on  sets  of  atomic  data  objects, 
thereby  causing  the  program’s  data  to  be  partitioned  into  a  number  of  modular  atomic  data  sets. 
Also,  transaction  systems  in  general  can  aid  the  programmer  in  another  important  way:  the 
processing  carried  out  in  order  to  commit  or  abort  a  transaction  can  handle  a  great  deal  of  lock- 
related  bookkeeping,  even  though  the  programmer  must  explicitly  obtain  locks  on  all  of  the  atomic 
data  items.  This  frees  the  programmer  to  consider  the  correct  behavior  of  the  transactions  being 
written,  without  giving  unnecessary  consideration  to  interactions  with  other  transactions  in  the 
system.  However,  this  is  not  to  say  that  the  programmer  does  not  have  to  be  concerned  at  all  with 
locks  and  locking  protocols:  rather,  it  is  intended  to  point  out  one  aspect  of  lock  management  that  the 
programmer  does  not  have  to  handle.  Using  the  current  ArchOS  locking  protocols,  there  are  still  a 
number  of  decisions  concerning  locks  that  the  programmer  must  make-for  instance,  whether  to  use 
a  discrete  locking  protocol  or  a  tree  locking  protocol,  how  to  organize  the  locks  in  a  lock  tree,  what 
data  items  are  associated  with  a  given  lock,  and  so  on. 

3.3.4  File  System 

The  current  file  management  in  ArchOS  supports  a  single-level  file  structure  and  does  not  provide 
any  directory  structure  for  users.  However,  the  system  can  piovide  at  least  three  kinds  of  file 
properties.  First,  a  norma!  file  has  the  d.ata  portion  of  the  file  arnhi(*ct  in  volatile  .slorage.  Second,  a 
permanent  file’s  data  ijoition  is  allocated  in  permanent  (non  volatile)  storage.  Finally,  an  atomic  file 
has  tfio  data  portion  wliich  is  declared  as  an  atomic  data  object. 


Altticogh  the  current  file  system  is  n  single  level  file  structure,  a  client  does  not  need  to  know  tire 
exact  location  of  the  hie  itself.  Furthermore,  since  a  file  is  designed  as  an  instace  of  "FII.E"  aroLject, 
a  client  will  be  able  to  create  •a  ri'.'v/  type  nf  servico,  .such  as  a  directory,  a  replicated  atomic  file  type, 
access  list,  etc.,  on  top  of  that. 


22 


3.3.5  Real-Time  Facility 

The  real  time  facility  of  ArchOS  provides  essential  functions  to  perform  time-driven  scheduling  such 
as  obtaining  time  information  and  setting/resetting  scheduling  parameters  for  a  client  program.  This 
time-driven  scheduling  is  based  on  ArchOS’  "best  effort"  scheduling  model. 

Since  the  application  writer  best  knows  the  real-time  constraints  for  a  task,  the  ArchOS  client 
prog.am  should  be  able  to  define  application-dependent  deadlines  by  which  time  some  specific 
processing  milestone  must  be  reached.  ArchOS  should  take  this  deadline  information,  along  with  any 
other  relevant  parameters  and  schedule  the  runable  processes  in  a  manner  that  best  meets  the 
scheduling  policy  requirement  defined  by  the  client. 

The  client  may  specify  scheduling  policies  that  define  system  behavior  in  the  case  where  the  system 
is  overloaded.  For  instance,  the  client  can  specify  that  the  scheduler  should  attempt  to  minimize  the 
number  of  missed  deadlines  or  minimize  the  average  lateness. 

By  coordinating  with  a  scheduling  policy,  a  client  may  also  create  an  application  dependent 
reconfiguration  policy.  The  scheduling  and  reconfiguration  algorithms,  under  control  of  the 
appropriate  policies,  will  be  able  to  determine  a  (sub)set  of  processes  which  should  be  removed  from 
the  local  node. 

3.3.6  Policy  Management 

The  separation  of  policy  and  mechanism  in  ArchOS  is  important  to  meeting  its  flexibility 
requirements.  Some  of  the  system  facilities,  such  as  process  scheduling,  require  the  specification  of 
policy  with  respect  to  the  management  of  the  resources  involved,  using  mechanisms  defined  v/ithin 
ArchOS.  Thus,  an  application  designer  of  ArchOS  is  able  to  create  a  tailored  real-time  environment 
by  appling  his/her  own  best  effort  decision  making. 

For  example,  Schedule  and  Peconligure  will  b.e  recognized  by  ArchOS  cs  policy  nnmes.  Ail  of  the 
client  defined  policies  are  maintained  by  a  policy  set  arnbject  v/hioh  will  be  instantiaied  by  the- 
application  program.  Thus,  a  client  can  add,  delete,  or  cliange  an  existing  policy  by  invoking  a 
corresponding  system  primitive  during  run  time. 


23 


3.3.7  System  Monitoring  and  Debugging  Facilities 
The  system  monitoring  and  debugging  facilities  provide  various  abilities  to  control  the  behavior  of 
cooperating  arobjects  and  their  processes  during  execution.  For  instance,  a  client  process  can 
freeze  or  unfreeze  a  target  arobject  or  process's  activity  and  can  watch  or  capture  incoming  or 
outgoing  messages  of  the  target.  However,  the  system  monitoring  and  debugging  facilities  are  not 
offered  to  a  normal  client  arobject,  rather  to  a  specially  privileged  arobject  in  the  system. 

The  following  functionalities  will  be  created  on  top  of  the  current  monitoring  and  debugging  facility. 

•  Single-step  function  for  tracing  process  activity 

•  Creating  an  arbitrary  break  point  in  an  arobject 

•  Specialized  debug  functions  such  as  Redo  or  Undo  operation  for  arbitrary  functions 


24 


4.  Client  Interface  Specification 

4.1  Overview 

ArchOS  is  the  operating  system  being  designed  and  constructed  as  part  of  the  Archons  research 
project.  The  stated  goals  of  the  project  are  to  conduct  research  into  the  issues  of  decentralization  in 
distributed  computing  systems  at  the  operating  system  level  and  below.  In  this  area,  the  primary 
research  issues  being  investigated  within  Archons  are  decentralized  control,  team  decision  making, 
transaction  management,  probabilistic  algorithms,  and  architectural  support. 

It  is  planned  that  ArchOS  will  be  developed  in  three  stages.  The  first,  called  the  interim  testbed 
version,  is  now  underway  and  will  handle  a  small  set  of  Sun  workstations^  interconnected  by  an 
Ethernet.  It  is  expected  that  this  operating  system  will  constitute  am  existence  proof  of  many  of  the 
basic  concepts  involved  in  our  research  as  applied  to  the  construction  of  a  decentralized  operating 
system.  This  system  will  later  be  followed  by  a  more  complete  testbed  operating  system  incorporating 
the  lessons  learned  in  the  interim  testbed  construction.  Finally,  an  analysis  of  the  ArchOS  structure  is 
expected  to  result  in  the  construction  of  specialized  hardware  for  the  purpose  of  executing  a  full¬ 
blown  ArchOS  system,  and  ArchOS  v/ill  be  rewritten  to  run  on  it  at  that  time. 

In  this  document,  we  describe  the  characteristics  of  the  interim  testbed  version  of  ArchOS  currently 
being  designed.  This  system  is  intended  to  be  a  vehicle  with  which  these  issues  can  be  investigated, 
rather  than  an  operntional  applications  programming  environment.  The  client  interface,  then,  must 
be  have  sufficient  functional  completeness  to  allow  test  applications  to  be  constructed  with  which  to 
study  .ArchOS'  characteristics,  but  need  not  provide  a  particularly  complete  set  of  facilities.  The  set  of 
facilities  provided,  then,  v/ill  'oe  evaluated  with  respect  to  their  support  of  Archons’  primary  research 
goals,  rather  than  with  respect  to  an  operationally  complete  functionality. 

NeverttiC'less,  the  .set  of  function.^  di^scnhed  in  this  document  c-an,  and  should,  describe  a  set  of 
facilities  each  of  wliich  is  functionally  closed:  i.e.  each  functiun  provided  i;;  complete  so  that  its  use 
can  bo  measured  with  respect  t.o  Areftons’  research  issiie.s.  Ihus,  for  exa.mpla,  in  the  handling  of 
transactions  at  the  client  interface,  till  of  the  important  issues  in  transaction  management  are 
covered,  even  tliough  tithor  funclions,  such  as  file  management,  may  be  missing  or  skimrnod.  It  is  not 
to  t)o  assumed  that  missing  or  inco.-npiete  functionality  in  the  inleiim  iesLcd  ArchOS  represents 
veluo'o.:^')  lunction-iility,  but  rallii:‘r  that  the  issues  involved  v/itli  such  functions  are  net  witlrin  the 


s„ 

.,1111 


*tinn  h;  .1  j  u  !  mi.'h'..  t/f  .‘  .I  ;5 1  uVi,  Inc. 


25 


primary  Archons  research  interests.  Functionality  in  such  areas  is  expected  to  be  much  more 
complete  in  later  versions  of  ArchOS  incorporating  the  results  of  the  research  performed  during  this 
early  ArchOS  implementation. 

This  document  consists  of  four  chapters.  Following  this  introductory  chapter,  chapter  two  will 
contain  a  description  of  the  computational  model  of  ArchOS.  This  will  include  the  client’s  view  of 
application  software  structure,  with  a  description  of  the  primary  ArchOS  client  facilities,  language 
issues  and  the  general  application  software  development  environment  as  supported  by  ArchOS. 
Chapter  three  will  describe  the  operating  system  primitives,  including  a  brief  high  level  description  of 
the  underlying  operating  system  activities  likely  to  be  undertaken  in  response  to  each  of  the 
primitives.  Particular  attention  will  be  given  to  ArchOS’  handling  of  resource  management  and 
consistency,  as  well  as  recovery  issues.  Chapter  four  will  provide  a  rationale  of  the  design  decisions 
described  in  Chapters  two  and  three,  outlining  the  tradeoffs  made  in  the  selection  of  application 
structures  and  the  semantics  of  the  operating  system  primitives. 

4.2  ArchOS  Computational  Model 

In  this  model,  we  mix  two  paradigms  in  the  distributed  domain:  object-oriented  and  process-based 
programming.  While  this  mixture  is  hardly  novel  (e.g.,  [Lazov;ska  81]),  our  paradigm  differs  from 
others  in  several  ways.  The  primary  goal  is  a  high  level  of  modularity  and  maintainability  for  both  the 
real-time  application  implementer  using  ArchOS,  and  for  the  ArchOS  implementers  themselves. 

4.2.1  Principal  Components 

The  purpose  of  ArchOS  is  to  provide  an  execution  environment  for  a  real-time  decentralized  system, 
such  as  a  command  and  control  system.  The  application  software  of  such  a  system  can  be 
considered  to  be  a  single  distributed  program.  The  distributed  program  ctin  be  de.scribed  in  terms  of 
its  primary  constituent  components,  v/hich  can  bo  further  broken  dov/n  until  iiiu  kiiiiiliar  socjuontial 
coiriponenta  (processes)  are  described. 

4. 2. 1.1  Distributed  Program 

A  distributed  program  consists  of  one  or  more  arobjects  (major  program  modules  --  .see  Section 
4. 2. 1.2)  working  toward  a  single  goal.  Generally,  in  a  real-time  syctc-m,  the  entire  system  can  be 
corioidered  to  be  executing  a  single  such  di-s-tributed  program.  It  is  po.s.sible,  however,  tint  more  than 
one  distributed  program  could  be  running  in  a  system  simiiiliiniiou.sly  (particularly  during  system  tost 
iCtiviii(;3),  .sul'ject  to  the  availability  of  rt-.sources,  hut  the  prosence  of  t.anro  than  one  pregi.-i.i)  ma/ 
r  a.'!:.(  cjr[:  ::i  eificalioir;  u/it.-'n.ablo  bec-ame  of  the  ;■  .i/oiat  '.d  .aii.i',.iniy  ro '.uivcd 


26 


resource  conflicts.  Each  program  contains  a  single  Root  arobject  which  will,  in  turn,  spawn  other 
arobjects. 

In  fact,  because  ArchOS  will  be  a  testbed  operating  system  for  the  foreseeable  future,  it  is  expected 
to  be  used  almost  exclusively  in  "system  test  activities".  Because  of  this,  we  have  planned  the  client 
interface  to  allow  more  than  one  program  to  be  executed  simultaneously.  While  it  is  true  that 
performance  specifications  may  be  compromised  during  such  tests,  the  majority  of  such  tests  may 
not  be  greatly  impacted.  ArchOS  will  keep  track  internally  of  which  program  components  are  part  of 
each  separate  program,  using  a  fixed  set  of  policies  for  handling  resource  allocation  conflicts  among 
the  programs.  The  remainder  of  this  document,  however,  will  concern  itself  with  the  handling  of  a 
single  program. 

4.2.1 .2  Arobject 

An  arobject  is  a  distributed  abstract  data  type  consisting  of  two  principle  parts:  a  specification,  and 
a  body.  An  instance  of  an  arobject  can  be  dynamically  created  using  the  CreateArobject  primitive. 
More  than  one  instance  of  a  single  arobject  can  be  created,  each  having  a  distinct  identity,  forming  an 
arobject  class.  The  arobject  identifier  returned  by  the  create  primitive  identifies  the  particular 
arobject  instance  to  be  addressed  during  communications,  or  arobject  communications  may  be 
initiated  to  an  entire  arobject  class,  using  existential  or  universal  quantifiers  to  specify  destination 
addressing  (see  description  of  Request  primitive  in  Section  4.3.3. 1). 

The  lifetime  of  the  arobject  is  under  user  control  and  is  potentially  unlimited.  The  system  build 
procedures  place  the  uninstantiated  copies  of  the  compiled  and  linked  arobjects  on  long  term  stor.ige 
(e.g.  di.sk),  but  the  creation  of  an  arobject  causes  an  instance  of  the  arobject  to  be  created  and 
uniquely  named.  An  arobject  instance  will  be  removed  at  the  end  of  its  lifetime,  or  a  kill  operation  can 
remove  an  arobject  instance. 

4.2.  J  .2.1  Arobject  Specification 

The  urobject  specification,  describing  the  external  user’s  viev/  of  the  arobject,  oof  sists  of  .a  .'-ot  of 
d.'ita  types  and  a  set  of  operations  which  other  arobjects  will  use  to  activate  .services  offered  hj  tfie 
arobject.  The  specification,  although  it  completely  specifies  the  external  inlerfaco  to  the  arobject, 
makes  no  commitment  with  respect  to  the  number  of  processe.s  v;ilhin  the  arobject,  Ihoir  functieiis,  or 
their  rji.stribulion.  The  operations  are  specified  in  a  manner  similar  to  functions  in  proco.-Iural 
prrjgramming  languages:  the  operation  is  specified  with  its  name,  its  input  argunic'iits.,  and  its  output 
argumf.nt.s.  All  operation  invoc.-nfien.s  arul  replies  iico  cn!l-by-v.aluo  semantics. 

It  is  import.'int  to  note  lii.si  there  i  -,  no  '•t-dined  r.jl.'.iion  diio  hot''.'i, en  at  aroiijeci'  ;  '>i';:c,,ioii;:  s  ■  !  .‘lie 


27 


entities  (processes  or  procedures)  in  its  body.  Any  operation  could,  in  principle,  be  handled  by  any 
process  in  the  arobject  (see  description  of  the  Accept  primitive  in  Section  4.3.3.4). 

4. 2. 1.2. 2  Arobject  Body 

The  arobject  body  consists  of  descriptions  of  private  data  types,  private  abstract  data  types,  private 
operations,  private  arobjects,  and  processes.  Every  arobject  body  must  contain  at  least  one  process, 
but  the  other  components  of  an  arobject  body  are  optional.  This  private  information  is  visible  only  to 
processes  within  the  arobject.  If  a  process  is  not  resident  in  the  same  node  as  an  instance  of  a  private 
abstract  data  type,  the  semantics  of  a  call  to  one  of  that  private  abstract  data  type’s  procedures  are 
those  of  a  remote  procedure  call.  The  semantics  of  the  remote  procedure  call  will  ensure  that  the 
procedure  will  be  called  at  most  once;  it  will  be  the  responsibility  of  the  calling  process  to  handle  the 
condition  of  a  procedure  never  being  executed. 

Thus,  there  are  a  number  of  potential  items  in  the  arobject  body,  each  with  a  set  of  semantics 
controlling  their  use.  The  procedures  defined  within  a  private  abstract  data  type  will  determine  the 
access  rules  to  the  encapsulated  private  data,  handling  mutual  exclusion  if  needed,  or  providing 
"dirty"  access  if  this  is  acceptable  to  the  client  system  functional  specifications.  The  private  data  can 
be  any  normal  data  types,  but  may  also  be  defined  as  atomic  or  permanent.  Data  defined  as  atomic 
will  be  forced  to  stable  storage  upon  commitment  of  a  transaction,  or  restored  to  its  prior,  or  some 
equivalent,  state  upon  transaction  abort.  Permanent  data  is  similar,  except  that  the  copying  of  this 
data  to  stable  storage  is  done  asynchronously  (with  respect  to  the  changes  made  to  the  data)  by  the 
operating  system  or  by  an  explicit  primitive,  with  no  guarantees  with  respect  to  consistency.  Atomic 
data  access  is  possible  only  within  transactions  (see  Section  4.2.4). 

Private  operations  and  private  data  types  are  exactly  the  same  as  their  counterparts  in  the 
specification  part  of  an  arobject,  except  that  their  existence  is  invisible  outside  the  arobject:  hence 
these  oporafions  anti  types  may  bo  used  orriy  wittiin  the  ’ct.  Ftunh  nn.err'.tionr.  and  tyiies  .serve  to 
enhance  the  modularity  characteristics  of  arobjects  by  providing  for  nesting  of  distributed  abstract 
data  types.  Similarly,  private  .arobjects  are  simply  arotjjccls  which,  once  inutuntialcd  (see 
CrcdicArchject  primitive.  Section  4. 3. 2.1)  by  an  arobject,  are  visible  only  to  the  processes  within  the 
outer  arobject.  Their  existence  is  unknown  to  extern.al  arobjects,  arid  can  be  used  to  hide 
implementation  details  in  exactly  tho  same  way  as  normal  abstract  data  types. 


28 


4.2.1 .3  Processes 

A  process  is  a  sequential  execution  unit  within  an  arobject;  it  is  the  dispatchable  entity  as  viewed  by 
the  operating  system.  An  arobject  contains  one  or  more  processes,  of  which  one  may  be  named 
INITIAL.  If  such  a  process  exists,  it  is  automatically  invoked  at  arobject  creation  time  (see  description 
of  CreateArobject  primitive  in  Section  4.3.2. 1 ),  providing  for  initialization  of  the  private  variables  which 
comprise  the  arobject's  state.  Processes  may  share  a  data  object  with  other  processes  in  an  arobject 
by  encapsulating  it  as  an  instance  of  a  private  abstract  type.  Processes  may  be  explicitly  created  only 
by  other  processes  in  an  arobject:  their  existence  is  invisible  to  processes  outside  the  arobject.  Once 
created,  processes  are  terminated  at  their  own  request,  at  the  request  of  any  other  process  in  the 
arobject,  or  at  the  termination  of  the  arobject  instance. 

Processes  are  lightweight',  i.e.,  no  computational  state  (such  as  local  data)  is  implicitly  transferred  to 
a  process  via  invocation  other  than  its  formal  input  parameters,  so  the  invocation  activity  can  be  made 
to  be  procedurally  inexpensive,  particularly  if  appropriate  hardware  support  is  available  [Jensen  84). 
There  is  no  necessity  that  all  processes  in  an  arobject  be  located  at  a  single  node.  The  model  does 
not  even  require  that  processes  sharing  variables  be  located  at  a  single  node,  but  the  actual 
implementation  might  contain  such  a  limitation. 

4.2.1 .4  Sample  Arobject 

The  structure  of  a  distributed  program  and  each  component  of  an  arobject  is  illustrated  in  Figure 
4- 1 .  The  three  small  boxes  in  the  left  portion  of  the  figure  represent  three  cooperating  arobjects. 
Arobject  2  has  made  a  Request  of  arobject  1,  and  arobject  1  has  sent  a  corresponding  Reply. 
Similarly,  arobject  3  has  requested  some  service  from  arobject  2  and  received  a  reply. 

While  the  small  boxes  that  represent  arobjects  in  Figure  4-1  hint  at  the  internal  structure  of  an 
arobiect  (they  show  that  an  arobject  has  two  parts  and  that  one  part  contains  internal  processes  and 
private  aboil  act  d,''la  types  (pacit’s)),  they  do  not  show  much  detail.  The  large  box  on  ihu  right  side  of 
the  figure  is  an  exploded  view  of  arobject  2.  It  shows  that  the  hvo  parts  of  an  arobject  arc  a 
spocification  and  a  body,  and  it  shows  the  various  con.pononts  of  each  part.  (These  components- 
were  all  tiiscussed  earlier  in  ihis  chapter.) 

4.2.2  Communication  Facilities 

Arobiects  communicate  via  invocation  parameters  and  messages.  Unlike  a  roriiot o  procedure  c.all 
(RPC)  rnochaniGin,  the  requestor  and  server  .must  agree  to  communicate  with  each  other.  Their 
I  lationehip  is  Mim:-,  a  s'/mnictiical  pair  of  ci.operati.-'.q  :,rc''.iects  rather  than  that  of  a  masi.or/slave. 
y.iir.h  a  pair  c.iri  be  used  to  either  cliont/sorvcr  sydems,  or  cooperaiing  arohjc-cl 


29 


arobject  2  specification 
Data  Type  Definitions 
Operation  1 
Operation  2 
Operation  3 


arobject  2  body 
Private  Data  Type  Definitions 
Private  Operation  1 
Private  Operation  2 


private 

arobject 


padt 


process 


Figu  re  4- 1 :  Distributed  Program  using  Arobjects 

systems,  or  more  likely,  a  mixture  of  both.  This  view  pervades  both  the  application  and  ArchOS 
implementation  paradigms  being  constructed. 

In  this  example  shovm  in  Figure  4-2,  the  requestor  sends  a  request  mes-sage  explicitly  via  the 
Reque:'t  primitive  and  the  server  (i.e..  Process  A)  performs  an  Accept  primitive.  The  Trans-Id  variable 
in  the  requestor  is  sot  to  the  unique  request  tranr.ac'ion  ir!;.:ruifi(:ation  if  Ihe  request  initiation  is 
successful,  and  is  c'^t  to  a  null  transaction  id  ("NULl.-TID")  if  an  error  has  been  detected  (e.g  ,  invalid 
arobject  id).  The  body  of  the  request  message  must  match  the  corre-spondincj  .operation  parameter 
templatij  provided  in  the  target  arotjject  speoificalion,  using  r.aJi-by-value  semantics.  Siinilady,  the 
body  of  the  reply  me.ssage  contains  the  operation’s  return  parameters.  If  a  trancnclion  is  in  progress 
when  tlic  request  is  made,  ArchOS  vill  manage  the  applicable  transaction  semantics,  depending  on 
the  Iran. .action  type  (See  Sectioi’.  4.2.4). 


Requestor-  Arobject 


Requestor- Process 

{  : 

Trans- ID  =  Request(Target-arobject,<operation>,<msg>,<reply-msg>)j 

}  • 


Target- Arobject 


Process  A 
^  : 
loop{ 

status  =  AcceptAny(<operation>,<msg>): 
^process-id  =  CreateProcess{<Process-B>,<msg>): 

>  • 


Process  B 

{ 


Trans-ID  =  Reply(<req-trans-id>,<reply-msg>); 


Figure  4-2:  Communication  Paradigm  in  ArchOS 


4.2.3  System  Load  and  Initialization 

Although  system  generation  is  beyond  the  scope  of  this  document,  we  will  assume  that  a  system  of 
application  arobjects  has  been  built,  and  that  a  directory  (possibly  partitioned)  of  the  arobjccts  has 
also  been  constructed,  and  is  available  to  each  node.  For  each  dishibuiod  frogiam  (norm.aily  one, 
but  pos.eibly  more  than  one  during  tost  phases),  one  arobject  has  been  identified  as  a  root  arobject. 

At  system  startup,  once  ArchOS  has  initialicod  itself  and  determined  its  initial  state,  the  directory  will 
bo  searched  for  every  client  rcot  arobject,  each  of  which  v/iil  then  bo  autoinaiically  Croplcd,  including 
execution  of  it.-:  IMITIAL  process.  This  action  will  then  completo  the  syslo.n  load  and  initi.alizalion,  with 
or  without  operator  assistance. 


31 


4.2.4  Transactions 

ArchOS  will  use  transactions  in  order  to  apply  the  properties  that  transactions  have  traditionally 
displayed  in  database  applications  (such  as  failure  atomicity  and  (iata  permanence)  within  the 
operating  system.  OS- level  transactions  should  facilitate  the  maintenance  of  system  consistency 
while  simplifying  data  sharing  among  multiple  processes.  In  fact,  we  expect  to  use  transactions 
extensively  within  the  ArchOS  system  primitives. 

The  ArchOS  transaction  facility  does  not  provide  failure  atomicity  and  permanence  for  all  of  the 
data  accessed  by  a  transaction.  Rather,  only  the  data  items  that  have  been  explicitly  declared  to  be 
atomic  have  these  properties. 

4.2.4. 1  Elementary  and  Compound  Transactions 

ArchOS  will  support  two  types  of  transactions,  elementary  transactions  and  compound 
transactions,  which  may  be  nested  in  arbitrary  combinations.  Elementary  transactions  correspond  to 
traditional  nested  transactions  [Moss  81].  Compound  transactions  [Sha  84]  are  supported  by  ArchOS 
in  order  to  provide  more  potential  concurrency  than  elementary  transactions  and  differ  from 
elementary  transactions  in  one  important  way:  when  a  compound  transaction  commits,  the 
processing  which  makes  the  effects  of  the  transaction  permanent  and  visible  (the  commit  processing) 
takes  place  immediately.  In  this  manner,  a  compound  transaction  is  treated  as  if  it  were  a  top-level 
transaction,  even  though  it  may  actually  be  nested  within  another  transaction. 

Compound  transactions  must  be  viewed  differently  than  traditional,  serialisable  transactions.  In  the 
traditional  case,  the  transaction  writer  is  able  to  assume  that  his  transaction  will  be  executed  as  if  it 
were  the  only  transaction  in  the  system.  The  transaction  facility  will  perform  whatever  concurrency 
control  is  necessary  to  ensure  that  the  entire  transaction  v/ill  be  executed  atomically  so  that  no  other 
transaction  can  view  the  partial  results  of  this  transaction;  other  transactions  can  view  only  the  final, 
cornmitiej  result.s.  The  system  thereby  ensures  that  the  opeciHoiis  performrd  on  the  atomic  data 
objoct.s  correspond  to  some  serial  ordering  of  the  transactions;  this,  in  turn,  guarantees  that  a  sot  of 
transactions  v/hich  individually  preserve  the  consistency  of  the  atomic  objects  they  access  wiil  ;ilso 
collectively  preserve  that  consistency. 

Compound  transactions  do  not  allov/  Iho  transaction  v/riterto  act  as  though  that  writer's  tran.saction 
IS  the  only  one  active  in  the  system  at  any  given  time.  Rather,  the  writer  of  a  compound  trans.’cl'nn 
must  roalir.o  that,  in  essenso,  a  compound  tran.saction  allovys  other  iiansactions  to  viov/  partial  results 
of  cnmput.'tions.  That  is,  if  a  compound  transaction  is  nested  within  another  ti.iusaction,  v/iu'n  »he 
rcn.po'u.rl  a  oisacticn  'x  lainils,  it  is  implicitly  allov/irig  otlior  tr.'M  .ac'vjns  to  ' '^v/  fhe  shilo  >  '  ihe 


32 


atomic  data  objects  that  it  has  manipulated.  Since  the  outer  transaction  has  not  yet  committed,  this 
state  may  be  considered  a  partial  result  of  the  computation  being  performed  by  the  outer  transaction. 
(Compound  transactions  allow  partial  results  to  be  visible,  thus  admitting  the  possibility  of  side 
effects,  while  providing  a  means  of  increasing  concurrency  in  applications  where  such  transactions 
can  be  employed.) 

Due  to  the  fact  that  a  compound  transaction  can  allow  other  transactions  to  see  partial  results  of  an 
ongoing  computation,  compound  transactions  must  be  written  carefully.  In  particular,  a  compound 
transaction  should  perform  a  consistency  preserving  transformation  on  the  set  of  atomic  data  objects 
that  it  accesses.  A  consistency  preserving  transformation  is  a  set  of  operations  that  transforms  a  set 
of  atomic  data  objects  from  one  consistent  state  into  another  consistent  state.  In  this  way,  no  other 
transaction  can  ever  see  a  set  of  atomic  data  objects  in  an  inconsistent  state.  For  applications  in 
which  it  is  acceptable  to  allow  other  transactions  to  view  certain  partial  results  that  represent 
consistent  states  for  a  set  of  atomic  data  objects,  compound  transactions  may  be  used,  and  an 
increase  in  overall  application  and  system  concurrency  can  result. 


The  behavior  described  above  has  several  important  consequences  for  compound  transactions: 

•  It  may  be  impossible  to  "undo"  (in  the  sense  of  Moss’s  nested  transactions)  the  effects  of 
a  committed  compound  transaction  which  is  nested  within  another  transaction.  That  is, 
in  the  event  that  a  commited  compound  transaction  is  later  aborted  (due  to  the  abortion 
of  a  higher  level  transaction  which  contains  that  compound  transaction),  there  is  no  way 
that  ArchOS  can  guarantee  that  the  atomic  data  structures  manipulated  by  that 
compound  transaction  can  be  restored  to  the  same  state  that  they  possessed  prior  to  the 
initiation  of  the  transaction.  However,  ArchOS  does  provide  a  method  by  which  an 
arobject  author  can  define  an  operation,  called  a  compensation  operation,  to  be 
associated  with  each  operation  defined  on  a  given  arobject.  It  is  expected  that  the 
arobject  writer  will  write  a  compensation  operation  for  each  arobject  operation  that  could 
be  invoked  during  a  compound  transaction.  (In  some  cases,  a  compensation  operation 
will  involve  more  than  simply  restoring  an  object  to  its  original  state.  For  example,  after  a 
cotTipound  transaction  has  altered  tlio  value  of  some  shared  variable,  it  is  pos-sihlo  for 
other  tiansactions  to  read,  or  even  change,  that  value.  Lrater,  if  compensation  inust  ba 
performed  for  the  committed  compound  transaction  (duo  to  the  abortion  of  a  highai  -ievel 
transaction),  it  may  be  necoseory  to  tal-.e  additional  stops  to  assure  that  the  visibility  of  the 
shared  variable’s  value  during  the  interval  between  the  compound  transaction’s 
commitment  .and  subsequent  abortion  has  Jtot  caused  any  unde.sirable  side  olfects.  One 
possible  .step,  for  instance,  might  involve  bro-adcasting  a  mes.s.ago  to  all  of  the  potential 
viev/ers  of  the  shared  variable’s  value,  slating  that  a  tran.saction  abort  has  caused  the 
value  seen  mo:it  rec.ontly  to  be  invalidated.)  In  the  event  that  one  or  mo/e  arobject 
oporatiens  were  invoked  during  the  proco.'cing  of  a  committed  compound  transaction 
that  is  cub'joguently  aborted,  ArchC:')  will  properly  compose  liie  corresponding 
compeiisalion  operations  in  cider  to  imii.;t'.'  iho  effects  of  a  Iraditicnal  "undo"  opefolion. 
riiese  con’p'  o:  .'’tion  opnations  will  ;n  ‘aansfoiTn  iito  siatc'S  of  *ht;  aloinic  d  it.i 
'jlrucUiro.'..  ^■|.a!li!;nlal.^d  duriiuj  i'ne  con, peoo.d  iniiisoctioii  fX'XUlion  iu'o  •  i.him;  meml  cr  of 


33 


a  class  of  states  that  are  equivalent,  although  not  necessarily  identical,  to  the  pre¬ 
transaction  states  of  those  data  structures.^ 

•  Compound  transactions  can  be  used  to  release  shared  resources  before  the  completion 
of  an  entire  nested  transaction  execution,  potentially  increasing  system  concurrency. 

Transactions  are  defined  by  the  arobject  programmer  by  means  of  ArchOS  primitives.  These 
primitives  appear  as  programming  constructs  similar  to  those  used  to  define  while  and  for  loops. 
Such  a  syntactic  structure  allows  a  transaction  to  be  placed  at  any  point  in  any  process,  while 
insuring  that  both  the  beginning  and  the  end  of  the  transaction  occur  within  a  single  process. 

ArchOS  will  handle  the  ordering  of  compensation  operations  for  aborted  compound  transactions 
automatically.  This  will  be  done  by  constructing  a  sequence  of  compensation  operations 
corresponding  to  the  operations  performed  during  the  execution  of  a  compound  transaction.  In  the 
event  that  this  compound  transaction  is  committed  and  subsequently  aborted,  these  compensation 
operations  will  be  performed  in  the  reverse  order  of  the  original  execution  sequence.  As  explained 
above,  although  ArchOS  will  initiate  the  processing  of  these  compensation  operations,  the  operations 
themselves  must  be  defined  by  the  author  of  the  arobject  whose  operations  were  invoked  by  the 
compound  transaction. 


The  ArchOS  client  will  often  need  to  access  shared  data  during  transaction  processing.  Such 
accesses  are  coordinated  by  means  of  a  locking  mechanism.  The  client  must  explicitly  request  locks 
on  the  shared  data  objects  that  are  needed  for  the  execution  of  a  given  transaction.  The  following 
section  discusses  the  ArchOS  locking  facilities  in  detail. 


4. 2. 4. 2  Locks 

ArchOS  supports  two  types  of  locks;  discrete  locks  on  independent,  individual  data  items,  and  tree 
locks  fSilberschal/.  80],  which  .are  structures  uf  related  uiccrcie  lucks.  In  the  case  r;f  discrete  locks, 
the  client  obtains  (sets)  a  lock  for  the  desired  data  item,  manipulates  Ihe  item,  and  releases  the  lock. 


free  locks  are  handled  in  a  somewhat  o'ilferent  manner.  In  this  case,  there  is  a  tree  structure  of 
locks,  and  tiiore  are  a  few  rules  that  must  be  followed  when  accessing  ‘lie  locks.  In  particular: 
o  Initially,  a  client  may  set  a  lock  located  at  any  point  in  the  tree  of  locks. 

o  Subscciuontly,  a  client  may  set  any  lock  who.so  p.arent  lock  is  currently  held  by  th.at  client. 


'T.'  .I  iiin,  llii:',  r.  not  '  .'.I'n  inio.  Wi  f.-.cl,  Mio  ■  u(  lli;;  ■1  .nti.:  (I'll  ■■  iil  i.'O  it  !■'  'In'  stul'l  lli.lt 

■.•/('Mill  li.ivn  .  I  I  ii’d  ;ll  ul  .ii',-  (ri:"i  i '  ;  .1  i.oiniiiillr'il  tl.ii.' .;..i '.ii.',  ii  nUi.:  ,h  i!..-  :  ■!  .i  il  i|v,'  ,'ll:(iiir,l 

iiofilpnijf.d  !i .  ;n  lion 


34 


•  A  client  cannot  acquire  a  tree  lock,  release  it,  and  then  reacquire  it  before  all  of  the 
client's  locks  in  that  tree  have  been  released. 

•  Locks  may  be  released  at  any  time,  as  long  as  the  above  conditions  hold. 

Tree  locks  have  one  important  property:  if  all  of  the  locks  involved  in  a  set  of  computations  are 
contained  in  a  single  lock  tree  and  if  the  above  rules  are  obeyed  and  if  all  tree  locks  are  released  in  a 
finite  amount  of  time,  then  deadlocks  involving  locks  in  the  tree  are  impossible. 

Within  transactions,  locks  are  obtained  explicitly  by  the  arobject  programmer.  When  a  nested 
elementary  transaction  commits,  its  locks  are  passed  to  its  parent  transaction:  when  a  top-level 
elementary  transaction  or  any  compound  transaction  commits,  all  of  the  locks  obtained  by  that 
transaction  are  released.  (ArchOS  will  handle  this  automatically.)  When  a  transaction  is  initiated 
from  within  the  scope  of  a  higher-level  transaction  (see  next  section  for  a  discussion  of  transaction 
scope),  it  does  not  automatically  inherit  the  locks  which  belong  to  its  parent  transaction.  However,  it 
may  attempt  to  obtain  locks  held  by  any  of  its  ancestors  by  means  of  the  SetLock  primitive. 

In  addition  to  the  lock  facilities  mentioned  above,  a  primitive  is  provided  to  release  locks.  However, 
the  ReleaseLock  primitive  must  be  used  carefully  within  transactions.  Transactions  possess  desirable 
properties  as  a  result  of  the  fact  that  they  restrict  the  freedom  with  which  the  locking  and  unlocking  of 
data  items  can  occur.  If  a  client  abuses  the  locking/unlocking  conventions  employed  by  the  ArchOS 
transaction  facility  by  improperly  making  ReleaseLock  invocations,  then  ArchOS  may  not  be  able  to 
complete  the  processing  of  the  client's  transaction.  Rather,  ArchOS  will  detect  the  violation  of  the 
locking  conventions  and  will  terminate  the  transaction  in  as  orderly  a  manner  as  possible.  More 
precisely,  ArchOS  will  abort  the  transaction  upon  detection  of  a  lock  protocol  violation.  This  may 
result  in  an  inconsistent  or  incorrect  state  for  some  atomic  objects.  However,  since  the  tran.saction 
can  only  release  locks  that  it  can  explicitly  name,  tfie  integrity  of  atomic  data  objects  which  are 
manipulated  on  its  behalf  can  be  guaranteed  by  appropriate  use  of  transaction.s  to  enc.apsulate  these 
ether  atomic  data  objects.  Such  an  encapsulation  v;ill  guarantee  that  the  atomic  data  objects 
manipulated  by  Ihoce  transaciioris  v/ill  'oe  in  consialont  states  and  can  recover  to  ether  ccnoi.stent 
states  v/hen  the  lock-violating  trans.iction  is  aborterl.  However,  since  the  lock-violating  transaction 
has  abused  the  locking  conventions,  it  may  bo  impossible  to  properly  recover  the  atomic  data  objects 
that  it  manipulates  directly.  In  that  case,  ArchOS  cannot  ensure  that  these  atomic  data  objects  will  be 
in  a  consistent  state.  (See  the  explanation  concerning  consistency  preserving  t,  .insform.itions  in 
Section  4.2.4. 1 .) 


35 


instance,  it  can  be  used  to  manipulate  tree  locks  or  to  release  locks  that  were  obtained  during 
non -transaction  processing. 

4. 2. 4. 3  Transaction  Scope,  Models,  and  Lock  Passing 
This  section  defines  some  additional  terminology  for  discussing  transactions  and  introduces  the 
notion  of  a  transaction  tree,  a  structure  that  allows  a  simple  visualization  of  the  relationships  among 
various  transactions.  In  addition,  the  rules  for  lock  propagation  in  the  ArchOS  transaction  facility  are 
specified  in  more  detail. 

The  preceding  discussion  concerning  transactions  dealt  with  the  notion  of  nested  transactions. 
The  transaction  tree  is  a  structure  that  can  be  used  to  explain  the  behavior  of  a  set  of  transactions. 
Figure  4-3  contains  an  example  of  a  transaction  tree. 


Figuro  4-3:  S.'.mplo  Transaction  T rce 


Fact)  circle  in  I'no  Inncaction  tree 
U-'ii  I  ■•'r.'leci  so  tlcU  tliey  id-'y  !).; 


n  (jroi:ont'>  a  tr-incaetion  r;,;ocution.  (Tlie  tiunsadicn  ric'leo  have 
iiu’iviilu.iilY  i.  fe'iencrMl  l..'i  i.)  Trai.'.r  .tion  tiee  iny.l.  lint  are 


36 


children  of  a  given  transaction  tree  node  represent  transactions  that  are  nested  within  a  parent 
transaction  (that  is,  either  they  are  contained  within  the  parent  transaction’s  definition  or  they  are 
executed  as  a  result  of  invocations  performed  as  part  of  the  parent  transaction).  Children  that  are 
connected  to  their  parent  by  a  single  line  only  are  executed  in  a  serial  order,  with  the  leftmost  child 
being  executed  first:  children  that  are  connected  to  their  parent  by  a  line  through  which  an  arc  passes 
are  executed  concurrently.  (These  concurrent  executions  are  typically  the  result  of  executing 
RequestSingle  or  RequestAII  primitives.) 

Each  transaction  has  an  associated  scope.  That  scope  is  delimited  by  the  transaction  definition 
primitive.  Any  statement  that  is  part  of  the  transaction  definition  is  within  the  scope  of  the  transaction, 
as  are  any  statements  that  are  executed  as  a  result  of  arobject  operation  invocations  or  procedure  or 
function  calls  made  by  statements  within  the  transaction’s  scope.  With  respect  to  the  transaction  tree 
representation,  all  of  the  subtree  at  or  beneath  the  node  corresponding  to  a  given  transaction  lies  in 
the  scope  of  that  transaction. 

The  transaction  tree  can  also  be  used  to  explain  the  ArchOS  rules  governing  lock  passing  among 
transactions.  As  explained  in  Section  4.2.4. 2,  tite  client  explicitly  obtains  the  locks  that  are  required 
by  a  transaction.  When  a  nested  elementary  transaction  commits,  all  of  the  locks  that  It  held  are 
passed  to  the  transaction  in  which  it  is  nested  (its  parent  in  the  transaction  tree);  when  a  nested 
compound  transaction  commits,  all  of  its  locks  are  released. 

The  rules  that  determine  which  locks  a  transaction  may  obtain  are  more  involved.  Two  transactions 
.  are  said  to  be  unrelated  if:  (1)  they  are  not  contained  in  a  single  transaction  tree,  or  (2)  they  are  in  a 
single  transaction  tree  and  are  concurrently  executing  siblings  or  descendants  of  concurrently 
executing  siblings,  or  (3)  they  are  in  a  single  transaction  tree  in  which  one  is  an  ancester  of  the  other 
and  either  the  descendent  is  a  compound  transaction  or  there  is  a  compound  transaction  on  the  path 
connecting  the  two  irancoclion  nodo-;  in  tha  correoponding  transaclion  ireo.  (rioto  lh.at  according  to 
this  definition,  a  compound  child  transaction  is  alv/ays  unrelated  to  its  parent  transaction.)  T’.vo 
unrelated  transactions  may  compote  fur  locks,  and  they  may  hold  locks  v.fith  compatible  lock  modes 
for  a  .sir’.gle  data  object  at  any  given  time.  However,  if  they  reque.st  incompatible  Ir.ck  modes  for  a 
single  data  object,  then  one  of  the  competitors  will  obtain  a  lock  and  the  other  v.’ill  bloc'c  until  it  can 
receive  the  desired  lock,  or  it  will  leturn  to  the  requestor  with  an  appropriate  status  indication. 

Tv/o  transactions  are  said  to  be  reUitcd  if  they  are  contained  in  a  single  tran.sactioft  tree  where  one 
is  tho  clc;,rerulcnt  of  the  oiiier,  tlie  dcscondont  is  a.n  olementaiy  t; ••'nsaclion,  and  there  are  no 
compound  'r.annaa  liniin  in  (lie  path  rturn-uXinq  th;.-ir  ir;.'.;):;.otiyo  nod':-.,  in  Ihe  u; ins, a: lion  tico.  ll'io 


37 


lock  compatibility  rules  for  related  transactions  are  different  than  those  for  unrelated  transactions.  In 
this  case,  the  descendant  transaction  can  obtain  any  lock  mode  for  any  lock  held  by  a  related 
ancester  in  the  transaction  tree,  including  incompatible  lock  modes  that  would  not  be  allowed  if  the 
transactions  were  unrelated.  (Of  course,  the  descendant  transaction  will  have  to  compete  with  all  of 
the  unrelated  transactions  in  the  system  in  order  to  successfully  obtain  the  requested  lock  with  the 
desired  mode.) 

Examples  of  both  related  and  unrelated  transactions  can  be  found  in  Figure  4-3.  For  instance, 
transaction  ’k'  is  related  to  transactions  M’  and ’d,’  but  it  is  unrelated  to  transaction  'a'  (due  to  point  (3) 
in  the  above  definition  of  unrelated  transactions).  Also,  while  transactions  'f,'  ’g,’  and  ’h'  are  all 
related  to  transaction  ’c,’  they  are  all  unrelated  to  one  another  (due  to  point  (2)  in  the  above  definition 
of  unrelated  transactions).  Finally,  there  are  only  two  other  related  transaction  pairs  in  the  figure: 
transaction  ’i’  is  related  to  transaction 'd'  and  transaction  ’b’  is  related  to  transaction  'a.' 

4.2.5  Real'Time  Facilities 

As  previously  stated,  ArchOS  is  a  real-time  operating  system;  for  ArchOS,  this  statement  carries  a 
number  of  important  implications.  In  this  work,  we  define  a  real-time  operating  system  to  be  one 
which  manages  its  system  resources  to  meet  user-defined  deadlines.  Processes  will  be  scheduled 
using  cleadiine-driven  scheduliiig  with  reference  to  user-defined  policies  (See  Section  4.2.6).  This 
means  that  when  the  system  determines  that  processing  resources  are  sufficient  to  meet  user 
deadlines  at  each  node,  it  wiil  meet  them,  but  when  resources  are  insufficient,  user  policy  v/ill  guide 
the  operating  system  in  its  decisions  as  to  which  deadlines  should  be  missed  or  whether  some 
processes  should  be  relocated. 

Examples  of  user  policies  to  be  implemented  in  the  event  of  insufficient  resources  include: 

•»  Minimize  averarje  lateness 

a  Minimize  ma;'.!fiiijm  laten.ess 

t  Minimize  nuinber  of  late  processes 

o  Minimize  priorities  of  late  processes 

In  scheduling  terminology,  these  policies  actually  define  objective  functions  for  the  resulting 
sciiedulinrj  algonthm(s).  Scfisduling  techniques  for  some  of  lhe.:e  objective  functions  (e.y.,  minimize 
ma/idium  latonecs)  are  v/ell  known,  while  for  must  others  optimum  algorithms  are  kno’.vn  to  be 
intractable.  ,^rGhO^,  will  use  best  off  a  t  cl.'^’ci.aion  m.-'king  to  implement  policies  such  rs  il;eco  and 
many  r  ‘li  i  ciiai|..i  n-  m,  policic:-',  aa  (.'a  .-?ly  .-,0;  posuble. 


38 


In  order  to  handle  real-time  constraints,  each  ArchOS  primitive  must  have  its  execution  time 
bounded.  This  bound  may  be  probabilistic,  and  should  be  determined  with  respect  to  the  significant 
events  or  actions  involved  in  fulfilling  the  request  (e.g.,  maximum  number  of  internode  messages, 
maximum  cpu  time  used). 

The  local  scheduling  model  used  by  each  ArchOS  node  to  manage  its  real-time  load  consists  of  a 
set  of  n  active  processes  pj,  i  between  1  and  n.  For  each  p.,  a  stochastic  execution  time  C.  and  a 
deadline  T.  are  known.  C.  is  a  value  estimated  by  the  process’  implementer  and  measured  by  the 
system  during  execution.  As  a  part  of  the  set  of  user  controllable  policies  described  above  for 
handling  missed  deadlines,  a  value  function  will  be  defined  for  each  processor  to  determine  the 
value  to  the  system  associated  with  achieving  a  particular  degree  of  lateness  for  process  p^.  The 
value  functions  themselves  are  either  chosen  by  ArchOS  with  reference  to  the  user  policies,  or  may 
be  provided  directly  by  the  user  policies,  thus  creating  an  extremely  large  set  of  potential  overload 
policies. 

The  system  will  continuously  monitor  its  performance  with  respect  to  the  likelihood  of  missing 
currently  known  deadlines,  using  a  best  effort  to  arrange  scheduling  to  distribute  the  lateness  should 
missed  deadlines  occur.  This  processing  is  currently  a  critical  portion  of  the  Archons  research  effort, 
and  the  results  of  this  research  will  be  directly  applied  to  this  scheduling. 

During  process  scheduling  at  a  single  node,  an  overload  condition  may  be  detected  in  which 
deadlines  may  be  missed.  The  scheduling  algorithms,  under  control  of  the  appropriate  policies,  may 
determine  a  (3ijb)set  of  processes  which  should  be  removed  from  the  local  node  tind  moved 
olsev/here.  The  decision  of  where  they  should  be  moved  will  be  made  by  an  algorithm  to  be 
developed,  also  under  control  of  the  application  defined  policy. 


Ptiiiiiiives  are  .available  for  specification  of  periodic  process  execution,  occr  specified  dol.cy,  rn.el- 
time  clock  ir;.':'nagement,  and  lateness  doctrine  policy  specification.  With  respect  to  Iho  ArchOS 
primiii'/es,  deadlines  are  defined  by  adding  (tu;  client-defined  deadline  interval  (see  Oe/ay  and  Alarm 
primitives,  Section  ■'1.3.9)  to  the  request  time,  which  is  defined  to  be;  (1)  the  scheduled  periodic 


process  execution  time,  (2)  the  expiration  time  of  a  Delay  or  Alarm  primitive  currently  in  progross,  (3) 
the  time  at  v/hich  a  now  process  becomes  ready  for  execution  following  a  CreateProens:;  piimitive. 
F’rocosses  for  v;hich  doadlino.s  have  not  treen  definerl,  it  any,  will  bo  scheduiod  with  the  ohj'.clive  of 
ma;<imum  throughput  subjoct  to  tho  constraint  that  deadlino.s  will  first  bo  mot  for  prccos;;es  with 


d-;adliucs. 


39 


4.2.6  Policy  Definitions 

ArchOS  exists  to  support  a  distributed  (application)  program.  In  order  to  provide  a  flexible 
environment  that  the  program  writer  can  tailor  to  satisfy  specific  application-dependent  requirements, 
ArchOS  allows  the  client  to  define  the  policies  used  to  manage  certain  system  resources. 

In  general,  the  ArchOS  client  may  dynamically  specify  the  policy  to  be  followed  with  respect  to  the 
management  of  a  specific  resource.  For  instance,  a  process  scheduling  policy  may  specify  the 
manner  in  which  processor  time  is  managed.  In  order  to  support  a  wide  range  of  alternative  policies, 
some  of  which  may  be  defined  by  the  client,  ArchOS  will  include  a  set  of  mechanisms  that  will  be 
combined  as  appropriate  to  supply  the  foundation  for  the  facility  that  the  client  has  selected. 

The  Archons  researchers  are  interested  in  studying  the  specification  of  policy  by  the  application 
programmer,  as  well  as  the  separation  of  policy  and  mechanism  within  a  given  resource  management 
facility.  In  ArchOS,  the  integration  of  client-selected  policies  into  the  system  will  be  studied  in  two 
areas:  process  scheduling  policies  and  process  reconfiguration  policies.  Process  scheduling 
policies  are  discussed  at  some  length  in  this  section  to  provide  examples  for  the  ArchOS  policy 
definition  facility  since  these  ideas  have  been  actively  pursued  by  Archons  researchers;  process 
reconfiguration  policies,  which  govern  the  dynamic  migration  of  processes  from  node  to' node,  are 
not  as  v/ell  developed  and  are  an  area  of  future  research. 

Consider  the  definition  of  a  proce.ss  scheduling  policy  by  the  client.  There  are  two  different 
situations  under  which  processes  are  scheduled:  the  situation  in  which  there  are  sufficient 
processing  resources  to  satisfy  all  of  the  client-specified  service  requests,  and  tho  situation  in  v/hich 
the  demand  for  processing  resources  is  greater  than  their  supply.  At  this  time,  ArchOS  policy 
management  is  only  concerned  with  the  latter  case,  and  this  is  reflected  in  the  support  provided  for 
client-specified  scheduling  policies. 

In  the  case  where  the  demand  for  processing  rcssutccs  is  greater  ifian  their  supply,  the  client  has 
two  options:  the  client  may  issue  a  cjoneral  policy  statement  which  corrospoiuls  io  a  pre-dofineci 
scheduling  policy,  or  the  client  may  <!ofine  a  new  policy  which  adheres  to  the  general  model  that  tho 
mechanisms  underlying  tho  ArchOS  process  scheduling  f.aciiity  support.  These  mechanisms  are 
aimed  at  optimizing  the  value  of  a  function,  known  .as  the  objective  function.  In  ArchOS,  the  objoclivo 
function  v./ill  maximize  the  sum  of  a  value  function  evaluated  for  each  runnable  process  on  a  given 
nofie.  V7heriover  the  clii.mt  selects  a  pre-defined  c.slieduiing  polir.y,  ,^rchOS  will  transform  tlds 
oCloctio  )  into  tho  upprupriale  vr'.lue  (and  hc-nco,  ol'i:ctive)  funr.licn;  in  '.itnahons  v/aere  llio  client 
'.v.anls  to  LK.C'  .(  c.l:  •'hiling  p  dicy  il.  d  -i  •  a,  lun  "leio'./  ihe  ci;  rd  must  e'Xi'licitly 


40 


specify  the  value  function  to  be  used.  (The  client  accomplishs  this  by  means  of  specifying  the 
parameters  for  a  value  function.  Value  functions  are  also  discussed  in  Section  4.2.5.) 

The  set  of  pre-defined  scheduling  policies  which  may  be  selected  when  the  demand  for  processing 
resources  exceeds  the  supply  includes  the  following  examples  (See  Section  4.2.5  for  a  discussion  of 
process  deadlines  and  real-time  considerations,  in  general.): 

•  minimizing  the  number  of  deadlines  missed: 

•  minimizing  the  average  lateness; 

•  minimizing  the  maximum  lateness; 

•  maximizing  the  minimum  lateness; 

•  minimizing  the  weighted  tardiness.  (Tardiness  is  defined  to  be  non-negative  lateness, 
and  the  weighting  factor  can  incorporate  process  or  arobject  priorities  into  the 
scheduling  computation.) 

The  implementation  of  client  policy  definition  in  ArchOS  will  include  the  notions  of  policy  names  and 
a  corresponding  set  of  policy  modules.  For  example.  Schedule  and  Reconfigure  will  be  recognized 
by  the  system  as  policy  names  and  their  bodies  can  be  kept  as  a  separate  policy  module.  All  of  the 
client-defined  policy  modules  are  maintained  by  a  policy  set  arobject,  which  vyill  be  instantiated  by  the 
application  program.  Thus,  it  will  also  be  possible  to  add,  delete,  or  change  the  existing  policy 
module  for  a  given  policy  by  invoking  a  corresponding  operation  of  the  policy  set  arobject  during  run 
time. 

Each  policy  module  can  be  built  by  referring  to  a  set  of  policy  attributes.  These  attributes  are  used 
to  determine  the  policy  to  be  carried  out  by  ArchOS  or  to  pass  information  to  the  policy  module  so 
that  it  can  make  a  particuUir  policy  decision.  For  instance,  the  arobject  priority  and  lateness  doctrine 
parameters  listed  below  could  be  used  by  ArchOS  to  trafislate  the  clierit’s  policy  desires  into  an 
objective  function  to  handle  process  sctieduling. 

Inclusion  of  a  policy  set  arobject  in  an  application  system  will  result  in  its  INITIAL  proceso  being 
soil  jduled  during  system  initialization.  This  arobject  may  then  define  a  set  of  policy  attributes  for  this 
program.  Some  e.xamples  of  attributes  that  tnight  be  used  for  the  Schedule  or  Reconfigure  policies 
are: 

o  Arobject  ptiorily  ••  relative  scheduling  priority  of  in.'tances  of  one  arobject  versus 
instances  of  other  arnbj-;-cts  in  the  ap[>licaticn  pfogram  during  pr^riods  v;hen  (hero  :.ri;  not 
enough  ccngiuting  (coonrc'.:;  to  satisly  :iil  rleinaiKlo..  It  ,.iiOuit.i  oe  iKiit.vi  that  in  itit.)  i.-vent 


41 


of  more  than  one  application  program  running  on  the  ArchOS  system,  no  relative 
priorities  may  be  determined  between  them,  so  performance  of  either  or  both  may  be 
adversely  affected.  (As  stated  in  Section  4.2.1. 1,  ArchOS  is  designed  expressly  to 
support  a  single  program,  or  set  of  arobjects.  This  set  of  arobjects  may  well  perform 
computations  that  would  normally  be  associated  with  a  number  of  separate  "tasks." 
However,  as  far  as  ArchOS  is  concerned,  this  collection  of  tasks  comprises  a  single 
distributed  program  since  all  of  the  relative  priorities  among  the  component  arobjects  are 
defined.) 

•  Lateness  doctrine  -  rules  to  be  followed  in  the  event  that  deadlines  must  be  missed.  This 
doctrine  indicates  the  nature  of  processing  which  will  characterize  degraded  modes, 
including  such  possibilities  as  maximizing  the  minimum  lateness,  minimizing  the 
maximum  lateness,  minimizing  the  average  lateness,  etc.  Execution  of  these  doctrines 
will  be  done  on  a  best  effort  basis  utilizing  the  available  information,  even  though 
incomplete  or  inaccurate,  maximizing  its  value,  and  making  the  decision  as  near  optimal 
as  possible,  consistent  with  the  application  performance  requirements. 


4.3  ArchOS  Primitives 

ArchOS  primitives  are  defined  as  a  set  on  operations  which  manipulate  various  system  and  kernel 
arobjects.  The  primitives  can  be  classified  as  system  and  kernel  primitives.  The  system  primitives  are 
provided  by  system  arobjects  which  exist  above  the  kernel  while  the  kernel  primitives  are  defined 
within  a  set  of  kerne!  arobjects.  In  this  section,  we  will  specify  all  of  the  ArchOS  primitives  (both 
kernel  and  system)  which  are  visible  to  a  client.  Other  kernel  primitives  will  be  described  in  the 
ArchOS  System  Architecture  Specification. 

4.3.1  Specification  of  Primitives 

We  will  use  the  following  format  to  describe  the  specification  of  ArchOS  primitives.  The 
specification  of  a  primitive  consists  of  two  parts.  The  first  part  describes  the  functionality  of  the 
primitive,  and  the  second  part  defines  the  actual  interface  for  a  client  process  by  using  a  syntactic 
template  sho'vn  in  below. 

It  should  be  noted  that  skico  we  adopted  a  C-like  syntax  in  this  document,  many  style  rf  type 
declataions  are  adapted  from  C  [Kernighan  78J.  For  instance,  a  pointer  called  "ptr"  to  a  typo,  say 
"TYPEj^",  will  be  specified  as  'TYPEj^  *ptr".  An  optional  .argument,  say  arg^,  of  a  primitive  will  be 
indicated  by  [,  arg  J  in  its  argument  list. 


valj^  =  Priinilivu(arg^, ...,  arg,^) 


rYPFo 


Tliis  in.lirsites  HsU  'i.e  "v;il  "  ha.;  iho  tyeo  "IVM-:,".  It 

l)  u 

d'.iiriition  l  YI 'i.,^  if  iicc.or;.-:  .-.ty. 


dso  incinrles  Ih’'  iv;;!) 


42 


TYPE^  arg^  This  line  specifies  the  type  of  argument^. 


TYPEj^  argj^  This  line  specifies  the  type  of  argument^. 


The  first  line  in  this  template  shows  that  an  ArchOS  primitive,  called  Primitive  requires  k 
arguments,  such  as  arg^, ...,  argj^,  and  returns  a  value. 

The  additonal  part  may  consists  of  error  or  abnormal  conditions  related  to  the  above  primitive 
invocation  and  may  have  the  following  explanations. 

On  Error:  This  part  explains  the  types  of  errors  that  can  occur  and  what  values  will  be  returned  in 
response  to  an  error.  In  general,  a  client  can  get  detailed  error  information  by  looking  at  a  specific 
area  called  an  error  block.  The  error  block  must  be  declared  by  using  a  SetErrorBlock  primitive  (see 
Section  4.3.11.5). 

On  Timeout:  If  a  primitive  has  a  timeout  argument,  then  this  may  explain  what  happens  after  a 
timeout  occurs. 

This  may  also  contain  extra  comments  such  as  follows. 

Note:  This  is  an  example  of  additional  comments  on  this  primitive. 

4.3.2  Arobjcct/Procoss  Management 

The  ar object  and  process  management  provides  creation  and  destruction  of  arobject  ins-;inces  as 
wnil  .IS  procee-s  inr.t.nnres.  The  C>n:UeArobjcct  and  CreoteProensr,  primitives  ct(:^ale  an  instance  of  an 
arobject  and  a  process  respectively.  Similary,  the  KillArobjcc!  and  KillProceso  primitives  kill  a  specific 
instance. 


The  lifetime  of  tin  arobject  ioblance  can  vary,  is  determined  by  the  lifetime  of  the  la.st  active  pyrocOoS 
instance  within  that  arobject  instance.  In  other  words,  if  an  insi-ince  of  an  arobject  is  killed,  all  of  its 
inrern.nl  processes  will  he  halted  and  removed.  A  process  may  terminate  by  normal  exit  (i.e..  reaching 
the  end  of  its  body)  or  by  an  expylicit  KiUProccss  primitive. 


43 


4.3.2. 1  Create 

A  CreateArobject  primitive  creates  a  new  instance  of  an  arobject  at  an  arbitrary  node  or  a  specified 
node.  Similarly,  a  CreateProcess  primitive  -creates  a  new  instance  of  a  process  in  the  arobject.  The 
selection  of  a  node  is  made  automatically  by  ArchOS  unless  overridden  by  the  Create  operation. 
Upon  arobject  creation,  the  arobject's  INITIAL  process  is  automatically  dispatched. 

An  optional  set  of  parameters  can  be  passed  to  the  INITIAL  process  when  the  arobject  is 
instantiated  by  using  an  initial  message  (i.e.,  "init-msg"). 


arobject-id  =  CreateArobject(arobj-name  [,  init-msg]  [,  node-id}) 
process-id  =  CreateProcess(process-name[,  init-msg]  [,  node-id]) 


AID  arobject-id  The  unique  identification  of  the  instantiated  arobject. 

PID  process-id  The  unique  identification  of  the  instantiated  process. 

AROBJ-NAME  arobj-name 

The  name  of  arobject  to  be  instantiated. 

PROCESS- NAME  process- name 

The  name  of  process  to  be  instantiated. 

MESSAGE  *init-msg 

A  pointer  to  the  initial  message  which  contains  initial  parameters  for  the  INITIAL 
process. 

NODE-ID  node-id  Node  identification  (optional).  An  actual  node  may  be  designated,  or  a  node 
selection  criterion  may  be  designated  (e.g.,  the  current  node,  any  node  except  the 
current  node,  any  node,  or  a  specific  node). 


Oti  Error:  If  a  CreateArobject  or  CreateProcoss  primitive  fail-s,  a  "NULL-AID"  or  "NULL-PID"  will  be 
returned,  respectively.  The  detailed  error  code  can  be  found  in  an  error  block  which  is  defined  by- 
issuing  a  SetErrorBlock  primitive,  (see  Section  4.3. 1 1 .5). 

4.3. 2. 2  Kill 

The  KillArohjcct  and  KillProcess  primitives  remove  a  prococr.  and  arobject  instance  respectively.  An 
arobject  may  be  killed  only  by  one  of  its  own  processes  (suicide  allowed,  no  murder).  In  order  to  kill 
a.iiother  aroi.'ject,  the  tat  got  arobject  must  have  an  appropriate  operation  defined  within  its 
'pcv  aicatiiin  r.o  it  can  kill  it  /.If. 


44 


A  process  can  be  killed  only  by  a  process  which  exists  in  the  same  arobject  instance. 

val  =  KillArobiect(aid) 
val  =  KillProcess(pid) 

BOOLEAN  val  TRUE  if  this  killing  was  successful;  otherwise  FALSE. 

AID  aid  The  arobject  id  of  the  target  arobject  to  be  killed. 

PID  pid  The  process  id  of  the  target  process  to  be  killed. 


On  Error:  If  the  specified  arobject  does  not  exist  in  the  system,  or  a  target  process  does  not  exist  in 
the  requestor’s  arobject  instance,  a  "FALSE"  value  will  be  returned. 

4. 3. 2. 3  SelfID  and  ParentIO 

The  SelfAid  primitive  returns  the  requestor's  arobject  id  and  the  ParentAid  primitive  returns  the 
parent’s  arobject  id  of  the  specified  arobject.  The  SelfPid  primitive  returns  the  process  id  (pid)  of  the 
requestor  and  the  ParentPid  primitive  returns  the  parent’s  pid  of  the  the  specified  process. 


aid  =  SelfAidO 

paid  =  ParentAid{aid-x) 

pid  =  SelfPidO 

ppid  =  ParentPid(pid-x) 


AID  aid,  aid-x,  paid  The  arobject  id. 
PID  pid,  pid  x,  ppid  The  process  id. 


On  Error:  If  a  non-t-j-xisling  aid  or  pid  is  given  to  Parer, tAid/Pid,  a  "MULL-AID"  or  "MULL-PID"  will  be 
returned. 

4. 3. 2. 4  SindMame 

Any  arobjoct  or  prococs  can  hav.3  a  (run  time)  reference  name  defined  by  these  binding  primitives 
witiiin  a  .single  diofntiuted  program.  The  BindAruhjectName  and  BindProcesaNaiiw  priinilives  bind 
the  rc(|i:i.':a;,d  instance  of  an  arcojoct  or  process  to  a  reference  naaic.  Unlcsri  an  iJnbind  priruitive  is 
c;. (•,  Ihe  lifefini  >  of  .a  biiuling  is  Hie  s-inie  aa  il’.o  lifetime  of  an  arobject  or  process  instance. 


45 


This  binding  allows  an  arobject  or  a  process  to  have  more  than  one  reference  name,  or  a  single 
reference  name  can  be  bound  multiple  arobject  or  process  instances. 

To  cancel  the  current  binding,  a  process  must  use  the  appropriate  unbind  primitive. 

val  s  BindArobjectName(aid,  arobj-refname) 
val  3  BindProcessNamefpid,  process- refname) 

BOOLEAN  val  TRUE  if  this  binding  was  successful. 

AID  aid  The  arobject  id. 

PID  pid  The  process  id. 

AROBJ-REFNAME  arobj-refname 

The  requested  reference  name  for  an  arobject  given  by  aid. 

PROCESS-REFNAME  process-refname 

The  requested  reference  name  for  a  process  given  by  pid. 

On  Error:  A  "FALSE"  value  will  be  returned  in  the  case  of  an  error.  For  instance,  if  a  client  attempts 
to  bind  non-existant  arobject  or  process  instance  to  a  reference  name,  a  "FALSE"  value  will  be 
returned. 

4. 3. 2. 5  UnbindName 

The  UnbindArobjeciName  and  UnbindProcessName  primitives  release  the  current  binding  between 
the  specified  instance  of  an  arobject  or  process  and  a  reference  name. 

val  -  UnbindArobjecti  fs,'nc(aid,  arobj-rofname) 
val  =  UiibindProcossMamefpid,  process  refname) 

BOOLEAN  val  TRUE  if  this  unbinding  was  successful;  otherwise  FALSE. 

AID  aid  The  arobject  id. 

PID  pid  Tfio  process  id. 

AROBJ  rUiFNAMF  ni  oKij  refnamo 

Tho  re<iu  js’.cd  rrjfjronce  rt.'in;,;  lor  :i;i  ar-'i.j.  ct  gi  /on  by  ,.iid. 


46 


PRCXJESS-REFNAME  process- refname 

The  requested  reference  name  for  a  process  given  by  pid. 


On  Error:  A  "FALSE"  value  w/ill  be  returned,  if  a  client  attempts  to  release  the  binding  for  a 
non-existing  reference  name. 

4.3.2.6  FindID  and  FindAIIIO 

A  FindID  primitive  returns  the  unique  id  (i.e.,  aid  or  pid)  of  the  given  arobject  or  process  in  a  specific 
search  domain.  A  search  domain  can  be  specified  with  respect  to  all  of  the  internal  arobjects, 
external  arobjects,  a  local  node,  a  remote  node,  or  a  reasonable  combination  of  among  four.  If  more 
than  one  instance  uses  the  same  reference  name,  the  unique  id  of  any  one  of  them  will  be  returned.  A 
FindAIIID  primitive,  on  the  other  hand,  returns  all  of  the  aid’s  and  pid’s  which  correspond  to  the  given 
reference  name. 


aid  =  FindAidfarobj-refname  [,  preference!) 
pid  =  FindPid(process-refname  [,  preference]) 
aid-list  =  FindAllAid(arobj-refname  [,  preference]) 
pid-list  =  FindAIIPid(processrefnarne[,  preference]) 


AID  aid  The  arobject  id. 

PID  pid  The  process  id. 

AID-LIST  aid-list  The  list  of  corresponding  aid’s. 

PID-LIST  pid-list  The  list  of  corresponding  pid’s. 

AnOoJ-Ri'.PMAfvIE  arobj- refname 

The  reference  namo  of  related  arobject(s). 

PROCEGS-ffLR'IAME  proc'-.'cc-refname 

The  ref  ;i  once  name  of  related  process(es). 


PREFERENCE  preference 

The  preference  can  specify  a  search  domain  si/ch  as 
"EXTERNAL",  "LOCAL",  "REMOTE",  "IMTERMAL-LOCAL", 
REMOTE",  "EXTEnNAL-LOCAL",  "EXTERMAL  REMOTE". 


"INTERNAI.", 

"INTEIIMAL- 


47 


On  Error:  If  the  FindAid  or  FindPid  primitive  fails,  a  "NULL-AID"  or  "NULL-PID"  will  be  returned 
respectively.  If  a  FindAII  primitive  fails,  a  "NULL-AID-LIST"  or  "NULL-PID-LIST"  will  be  returned. 

4.3.3  Communication  Management 

The  communication  management  provides  an  inter-  and  intra-node  communication  facility  among 
cooperating  arobjects.  A  process  can  invoke  an  operation  at  a  specific  instance  of  an  arobject  by 
sending  a  request  message  or  can  invoke  the  same  operation  at  multiple  instances  of  an  arobject 
which  have  been  bound  to  a  single  reference  name.  In  the  later  case,  the  multiple  computations  are 
performed  concurrently. 

In  particular,  a  process  can  send  any  arobject  instance  a  request  message  to  invoke  an  operation 
without  knowing  its  actual  location.  A  process  can  also  invoke  a  private  operation  which  is  defined 
within  its  local  arobject. 

There  are  essentially  three  types  of  primitives  to  provide  flexible  cooperation  among  arobjects: 
Request,  Accept,  and  Reply.  In  addition  to  these,  the  RequestAll,  RequestSingle,  and  GetReply 
primitives  are  added  in  order  to  interact  with  multiple  instances  of  arobjects  concurrently.  All  of  the 
communication  primitives  are  executed  as  transactions  so  that  ArchOS  can  provide  properties  such 
as  failure  atomicity  and  permanence,  (see  Section  4.3.6). 

A  message  consists  of  a  header  and  a  body.  The  message  header  will  be  generated  by  ArchOS  and 
contains  control  information.  The  message  body  carries  all  of  the  parameters  and  will  be  set  by  the 
client  process.  Since  each  arobject  has  a  separate  address  space,  parameters  must  be  sent  using 
call-by-value  semantics. 

4.3.3. 1  Request 

The  Request  primitive  providoG  remote  prccedurt!  c.a!!  se.'nanfic.s  in  which  th.e  requestin;;  process 
invol<es  an  operation  by  .sending  a  message  and  blocks  until  the  receiving  .arobject  returns  a  reply 
rnesGi'.ge.  Identically,  if  th.e  receiver  arobject  is  tho  same  arobject,  a  local  operation  v/ill  bo  invoi.ed. 
When  the  reply  is  >j onorated,  it  is  sent  to  the  requestor  which  is  tfien  unuicckecl.  If  the  lequosting 
process  needs  to  limit  the  allowed  response  time,  it  may  do  so  by  creating  a  small  process  to  handle 
the  tiifiGout  condition. 


trans  id 


n  '  .piosif-arohj  id,  opr,  msg,  reply  ri,sg) 


rri/vi  iPACTiori  io  h.ar,  ui 


tr  v.rr'a 


;t  id  ot  the  tra.nsac'i'-.ir)  on 


'.'half  tiro  rv  is  'oeit’.j  t!).-',!.!!). 


8 


AID  arobj-id  The  unique  id  of  the  receiving  arobject. 

OPESELECTOR  opr 

The  name  of  the  operation  to  be  performed. 

MESSAGE  *msg  A  pointer  to  the  message  which  contains  the  parameters  of  the  operation  to  be 
performed.  The  message  to  the  destination  arobject  must  not  contain  any 
pointers  (i.e.,  call-by-value  semantics  must  be  used). 

REPLY-MSG  *reply-msg 

A  pointer  to  the  reply  message. 


On  Error:  The  Request  primitive  may  fail  in  the  following  situations; 

•  The  destination  arobject's  node  is  not  available. 

•  The  destination  arobject  does  not  exist. 

•  The  target  operation  is  not  defined. 

•  The  request  message  does  not  match  the  receiver’s  message  type. 

If  the  Request  primitive  fails,  a  "NULL-TID"  will  be  returned. 

4. 3. 3. 2  RequcstSingle  and  RequestAII 

The  RequsstSingle  and  RequestAII  primitives  can  send  a  request  message  and  proceed  without 
waiting  for  a  reply  message.  The  RequestSingle  primitive  provides  nonblocking  one-to-one 
communication  and,  the  RequestAII  primitive  supports  one-to-many  communication.  The  requesting 
process  may  thus  invoke  an  operation  on  more  than  one  instance  of  an  arobject  or  process  with  one 
request.  To  receive  all  of  the  replies,  the  CctReply  primitive  tnay  be  repeated  until  a  reply  with  a  null 
body  is  received. 


trnn.s-id  =  RcquestSirglo{arcbj-id,  opr,  rnsg) 
trans-id  =  RequRstAIHarobj  rofnama,  opr,  msg) 


TRANSACTION-ID  trans-id 

The  trarj.sacticn  ID  of  the  transaction  on  wliose  behalf  tb.n  rotiuest  is  being  made. 
AID  arf'bj-id  The  ID  of  'I lO  receiving  arobject. 


Anonj  RCi-i-lA.Vil:  arobj  refiV'ii;o 

Iho  ni'  uriKAi  iicnio  of  Iho  icr.-'.ivinq  .•it.>ltjoct(.s). 


49 


OPE- SELECTOR  opr 

The  name  of  the  operation  to  be  performed. 

MESSAGE  *msg  A  pointer  to  the  message  which  contains  the  parameters  of  the  operation  to  be 
performed.  The  message  to  the  destination  arobject  must  not  contain  any 
pointers  (i.e.,  call  by  value  semantics  must  be  used). 


On  Error:  The  RequestSingle  and  RequestAII  primitives  fail  if  similar  to  those  situations  mentioned  in 
the  previous  section  happen.  If  the  RequestSingle  or  RequestAII  primitive  fails,  a  "NULL-TID"  will  be 
returned. 

4. 3. 3. 3  GetRepiy 

The  GetRepiy  primitive  receives  a  reply  message  which  has  the  specific  transaction  id  generated  by 
the  preceding  RequestAII  primitive.  If  the  specific  reply  message  is  not  available,  then  the  caller  will 
be  blocked  until  the  message  becomes  available. 


aid  =  GetReply{req-trans  id,  reply-msg) 


AID  aid  The  arobject  id  of  the  replying  arobject. 

TRANSACTION-ID  req-trans-id 

The  transaction  id  of  the  corresponding  RequestSingle  or  RequestAII  primitive. 

REPLY-MSG  *reply-msg 

A  pointer  to  the  reply  message. 


On  Error:  The  GetRepiy  primitive  fails,  if  the  specified  transaction  does  not  exist;  a  "NULL-AID"  v/ill 
then  be  returned. 

4. 3. 3. 4  Accept  and  AcceptAny 

The  proc.ess  responsible  for  an  arotijcct  operation  receives  a  message  using  the  Accept  primitiv'e. 
Using  the  selection  criteria  specified,  the  operating  system  .selects  an  eligible  message  from  the 
arobject's  input  queue  and  returns  it.  The  process  operates  on  the  message,  ro.spondinq  with  a  reply 
when  processing  has  boon  completed.  If  no  suitable  message  is  in  the  requect  mess  ago  queue,  Iho.n 
the  caller  will  block  until  such  a  message  becomes  available. 

Th-7  Ar;i-ci'lAny  priivtitiv  '  can  rcerMV.'  a  rnecr.age  frem  any  arobject  iu'.hv.ico  vritli  cuy  op  a-itiou  (i.e.. 


50 


"ANYOPR")  or  a  specified  operation  request.  The  primitive  can  return  the  requestor’s  transaction  id, 
specified  operator,  and  requestor's  aid.  The  Accept  primitive  can  receive  a  message  from  a  specific 
requestor  arobject  and  returns  the  requestor’s  transaction  id  and  the  requested  operator. 

(req  trans-id,  req-opr,  requestor)  =  AcceptAny(opr,  msg) 

(req-trans  id,  req-opr)  =  Accept(requestor,  opr,  msg) 

AID  requestor  The  aid  of  the  requesting  arobject. 

OPE-SELECTOR  opr,  req-opr 

The  name  of  operation  to  be  performed.  The  ”opr"  parameter  can  be  a  specific 
operation  name  or  "ANYOPR". 

TRANSACTION-ID  req-trans-id 

The  transaction  id  of  the  transaction  on  whose  behalf  the  request  is  made. 
MESSAGE  *msg  A  pointer  to  the  message  buffer. 


On  Error:  The  AcceptAny  primitive  fails  if  the  specified  operation  does  not  exist.  Similarly,  the 
Accept  primitive  fails  if  the  requestor  or  the  operation  does  not  exist.  If  the  AcceptAny  or  Accept 
primitive  fails,  a  "NULL-TID",  "NULL-OPR",  and  "NULL-AID"  will  be  returned. 

4. 3. 3. 5  CheckMessageQ 

The  CheckMescagcQ  primitive  examines  the  current  status  of  an  incoming  message  queue  without 
blocking  the  caller  process.  The  primitive  must  specify  a  message  queue  type,  either  "request- 
queue"  or  "reply-queue".  The  request-queue  queues  all  of  the  non-accopted  request  messages  and 
is  aliocatcd  for  each  ai  object  instance.  The  reply-queue  maintains  all  of  the  non-read  reply  niessages 
and  is  assigned  to  every  process  instance. 

A  message  can  be  selected  ba.sed  on  the  sender’s  arobject  id,  operation  name,  and/or  tran.-:action 
id.  If  more  than  one  argument  is  given,  only  messages  v;hich  satisfy  all  of  the  conditions  v/ill  be 
returned.  If  no  corresponding  message  exists  in  a  specified  message  queue,  a  "NULL-POINTER"  v/ill 
be  returned. 


ptr-rnds 


Choc  k  T;!  c  s  s  .1  g  <:  C  (q  ly pe,  nxiuc 


lor,  opr,  fr.'ms-id) 


Mr.r,  UEoCRiPTOMS  'prt-mds 


51 


Pointer  to  a  list  of  the  message  descriptors  selected  by  the  specified  selection 
criteria. 

MSG-Q  qtype  This  indicates  either  "request-"  or  "reply-"  message  queue. 

AID  requestor  The  aid  of  the  requesting  arobject. 

OPE-SELECTOR  opr 

The  operation  to  be  performed.  The  "opr"  parameter  can  be  a  specific  operation 
name  or  "ANYOPR". 

TRANSACTION-ID  req-trans-id 

The  transaction  id  of  the  corresponding  RequestSingle  or  RequestAII  primitive. 


On  Error.  The  CheckMessageQ  primitive  fails  if  the  specified  message  queue,  operation,  or  the 
transaction  id  does  not  exist.  If  the  CheckMessageQ  primitive  fails,  a  "NULL-POINTER"  will  be 
returned. 

4. 3. 3. 6  Reply 

After  processing  an  Accept  primitive,  a  process  must  use  a  Reply  primitive  to  send  the  completion 
message  to  the  requesting  process.  At  the  requestor’s  site,  the  completion  message  is  received  by 
the  second  half  of  the  synchrouns  Request  primitive  or  the  GetReply  primitive.  It  should  be  noted  that 
the  reply  need  not  necessarily  be  sent  from  the  same  process  which  accepted  the  operation.  In  other 
words,  the  requestor’s  transaction  id  is  used  to  determin  a  proper  reply  message. 


trans-id  =  Reply(req  trans-id,  reply  msg) 


TRANSACTIOM-!D  trans-id 


The  transaction  id  of  this  Reply  pr 


nsacticn  id) 


TRANSACTION-ID  req-trans-kl 

The  tr.m.'-'action  id  of  the  request  that  fias  been  serviced. 


MESSAGE  ’reply  msg 

A  pointer  to  the  reply  message. 


On  Error  Il.e  Reply  primitive  fails  if  the  specified  transaction  has  already  been  aboited.  If  the  Reply 
primitiv',’  i.el'i,  a  "Nl  ILL  TID"  v/i!l  be  rcturried. 


52 


4.3.4  Private  Object  Management 

A  process  can  dynamically  create  a  private  object,  vt/hich  is  an  instance  of  a  private  abstract  data 
type  defined  in  its  arobject,  at  any  node.  If  a  private  abstract  data  type  has  instances  of  atomic  or 
permanent  data  objects,  the  actual  data  objects  will  be  allocated  in  non-volatile  memory. 

4. 3. 4.1  Allocate/Free  Object 

An  AllocateObject  primitive  allocates  an  instance  of  a  private  abstract  data  type  at  any  node  and  a 
FreeObject  primitive  deallocates  the  specified  instance. 


object-ptr  «  AliocateObject(object-type,  parameters  [,  node-id]) 
val  =  FreeObject(object-ptr) 


OBJECT-PTR  object-ptr 

A  pointer  to  the  allocated  private  data  object. 

OBJECT-TYPE  object-type 

The  object-type  indicates  the  name  of  a  private  abstract  data  type. 

BOOLEAN  val  TRUE  if  the  object  was  released  successful;  otherwise  FALSE. 

NODE-ID  node-id  Node  identification.  An  actual  node  may  be  designated,  or  a  node  selection 
criterion  may  be  designated  (e.g.,  the  current  node,  any  node  e.xcept  the  current 
node,  any  node,  or  a  specific  node). 


On  Error;  If  the  object-type  is  not  defined  an  AllocateObject  primitive  fails  and  returns  a  "NULL- 
POINTER".  If  the  object-ptr  is  not  pointing  to  a  proper  permanent  object,  then  the  FresObjcct 
primitive  fails  and  returns  "FALSE". 

4. 3. 4. 2  FiushPermanent 

A  i  kinhPermaiwnt  primitive  bloc'^s  the  caller  until  the  specified  data  object  is  saved  in  non-volatile, 
storage. 


FlushPormanentfobjoct-ptr,  size) 

013JECT  P  TR  object-ptr 

A  f'.ointer  to  the  permanent  data  object. 


INTsi/o 


The  number  of  ’.  ytos  v/hich  tr,ii;;t  ho  fkir.lir,!  into  pormanor'.t  ctoiage. 


53 


On  Error:  The  FlushPermanent  primitive  fails  if  the  specified  object  does  not  exist. 

4.3.5  Synchronization 

ArchOS  provides  two  levels  of  synchronization  facilities.  One,  the  critical  region,  is  for  controlling 
the  concurrent  access  to  a  single  shared  object  within  an  arobject,  while  the  other,  the  locak,  is  for 
controlling  accesses  from  concurrent  transactions. 

The  critical  region  scheme  should  be  used  when  the  shared  object  does  not  need  to  provide  failure 
atomicity.  That  is,  a  client  may  be  able  to  access  an  inconsistent  state  of  the  object.  On  the  other 
hand,  if  these  objects  are  atomic  objects  and  are  accessed  from  transactions,  then  a  client  must  be 
allowed  to  see  only  consistent  objects.  In  the  critical  region  scheme,  mutual  exclusion  is  achieved  by 
implicit  locking  by  using  an  event  variable,  while  the  transactions  require  an  explicit  lock  on  the 
atomic  object.  The  CreateLock  and  DeleteLock  primitives  are  provided  to  create  and  remove  a  lock 
for  the  explicit  locking  scheme.  Actual  locks  can  be  set  by  using  a  SetLock  or  TestandSetLock 
primitive  and  can  be  released  by  using  a  ReleaseLock  primitive. 

4.3.5. 1  Region 

The  Region  construct  provides  a  simple  mutual  exclusion  mechanism  for  controllinQ  concurrent 
acces-ses  to  shared  objects.  A  timeout  value  must  be  specified  in  order  to  bound  the  total  execution 
time  in  the  critical  region  including  any  time  spent  waiting  to  enter  the  region.  If  a  timeout  occurs,  a 
client  process  is  forced  to  exit  from  the  critical  region  and  an  error  status  is  returned. 

Region(ev,  limeout){  . . .  } 

HVEMT-VAR  ev  The  event  vaii.ablo  consie.ts  of  a  vvaiti'vj  queue  of  client  pruccasos  and  an  event 
counter. 

TIMEOUT  timeout  The  timeout  value  should  indicate  the  maximum  execution  lime  for  thi.o  critical 
region  incluiJing  the  waiting  time. 

On  Vimonut:  A  client  chouid  be  able  to  ch(,'ck  v/hcth*':-r  the  critical  region  was  exited  duo  to  a  timeout 
[//  chcokiru]  on  or  information  in  its  error  block  (r.oe  Section  4.3.1  1 .5). 


54 


4. 3. 5. 2  CreateLock  and  DeleteLock 

The  CreateLock  primitive  creates  either  a  tree-type  or  discrete-type  lock  and  returns  a  unique  lock 
id.  Since  a  lock  id  is  not  bound  to  any  data  object  explicitly,  a  client  must  be  responsible  for  utilizisng 
the  lock  in  accordance  \Mith  the  locking  protocol.  The  DeleteLock  primitive  removes  the  specified 
lock  from  the  system. 


newlock-id  »  CreateLock(  [parcnt-lockid]) 
val  -  DeleieLock(lockid) 


LCXJK-ID  new>lock-id 

A  new  lock  id  will  be  returned. 

BOOLEAN  val  TRUE  if  the  lockid  is  removed  successfully:  otherwise  FALSE. 

LOCK- ID  parent-lockid 

If  the  created  lock  must  be  a  tree-type  lock,  then  its  parent-lockid  must  be 
specified.  If  the  parent-lockid  is  a  "NULL-LOCK-ID”,  then  the  new  lock  will  be  the 
root  of  a  new  lock  tree.  If  a  parent-lockid  is  not  given,  then  the  new  lock  will  be  a 
discrete-type  lock. 

LOCK-ID  lockid  The  lockid  to  be  deleted.  If  the  lockid  is  a  tree-type  lock,  then  the  entire  subtree  of 
which  this  lock  is  the  root  will  be  deleted. 


On  Error-.  If  a  parent  lockid  does  not  exist,  then  the  CreateLock  primitive  fails  and  returns  a  "NULL- 
LOCK-ID".  A  DeleteLock  primitive  fails  if  the  specified  lockid  does  not  exist,  and  returns  "FALSE". 


4. 3. 5. 3  SelLock,  TestLock,  and  ReloaseLock 
The  SelLock  primitive  sets  a  "tree-type"  or  "discrete-type"  lock  cn  arbitrary  objects  by  specifying  a 
lock  key  and  its  mede.  If  a  reqiiested  look  is  bci-og  hold,  tfic  caller  will  hloo.k  until  it  is  .''eleased.  The 
TedandSclLock  primitive  also  tries  to  set  a  lock,  however,  it  will  return  a  "FALSE"  if  the  lock  is  f'oinq 
held.  If  the  request  lock  is  a  li  ee-lock  type,  then  the  SctLcck  anri  rastandSoiLock  piitnitivos  may  also 
fail  due  to  the  violation  of  the  tree-lock  convention  (See  Section  4.2.4.2). 


The  TestLock  primitive  checks  the  availability  of  a  specified  lock  with  a  lock  mode.  In  the  ciiso  of  a 
tree  lock,  it  also  checks  whether  the  locking  wouk!  be  legal  in  the  corresponding  lock  tree.  The 
Rclensei  oak  primitive  can  release  Hie  lock  on  an  object  which  wns  gained  by  the  Setl  ork  or 
TostaiulSell.ock  primitive  explicitly. 


55 


sval  3  SetLock(tock-type,  lockid,  lock-mode) 
sval  =  TestandSetLock(lock-type,  lockid,  lock-mode) 
tval  s  TestLock(lock-type,  lockid,  lock-mode) 
rval  =  ReleaseLock(lock-type,  lockid,  lock-mode) 

INT  sval  1  if  the  specified  lock  is  set;  0  if  the  lock  is  not  set.  A  negative  value  will  be 

returned  if  an  error  occured. 

INT  tval  1  if  the  specified  lock  is  being  held;  0  if  the  lock  is  not  being  held.  A  negative 

value  will  be  returned  if  an  error  occured. 

INT  rval  1  if  the  specified  lock  is  released;  0  if  the  lock  is  not  released.  A  negative  value 

will  be  returned  if  an  error  occured. 

LOCK-TYPE  lock-type 

The  lock  type  can  be  either  "TREE"  or  "PISCRETE". 

LOCK-ID  lockid  The  lockid  indicates  the  unique  id  of  a  lock. 

LOCK-MODE  lock-mode 

The  lock  mode  can  be  "READ",  "WRITE",  etc. 


On  Error:  An  error  may  occur  if  a  non-existent  lock  id  or  lock  mode  is  used  for  the  above  primitives. 
A  detailed  error  condition,  sucli  as  a  non-existent  lock  or  lock  mode,  is  available  by  looking  at  the 
caller’s  error  block  (see  Section  4.3. 1 1 .5). 

4.3.6  Transaclion/Recovery  Management 

ArchOS  can  allow  a  client  process  to  create  a  compound  transaction  or  an  elementary  transaction 
which  can  be  nested  in  any  combination.  By  using  nested  elemenhary  transactions,  a  client  can  use  a 
traditional  "nested  transactions"  mechanism.  In  addition  to  this,  the  compound  transaction  provides 
a  mechanism  which  can  commit  the  transactions  at  the  end  of  the-  current  .scope  without  delaying  the 
commit  point  to  end  of  the  top-level  transaction.  Since  a  completed  nested  compound  iransaction 
cannot  be  undone,  ArchOS  provides  a  mechanism  to  perform  corresponding  compensate  actions 
automatically  (see  Section  4.2.4). 

it  should  be  noted  that  ArchO.S  cannot  generate  such  a  sequence  of  compons  ito  actions 
autoni.atically.  Hov/evor,  ArchOS  provides  a  mechanism  to  execute  the  defined  coiupencate  actions 
in  ;i:e  proper  r.ef|uenoe  to  mal-e  the  statu;;  of  each  affected  atomic  object  into  a  me  nbor  of  an 
Oiiniv-ilvnco  cf  Its  correct  "pn’  oxociition"  (See  Cecti.)n  4.2. 4,1) 


56 


4.3.6. 1  Compound  Transaction 

A  compound  transaction  construct  creates  a  new  transaction  scope  in  a  client  process.  Within  this 
scope,  a  client  can  access  atomic  objects  as  if  these  computational  steps  were  executed  alone. 

When  a  compound  transaction  starts,  no  locks  will  be  inherited  from  its  parent  transaction  if  one 
exists.  That  is,  all  of  its  locks  must  be  obtained  within  this  scope  by  means  of  the  SetLock  primitives. 
(See  Section  4.3.5.3).  However,  at  the  end  of  the  compound  transaction  scope,  ArchOS  releases  all 
of  the  locks  for  this  transaction  automatically.  It  is  also  possible  to  release  locks  before  the  end  of  the 
transaction  scope  by  using  a  ReleaseLock  primitive  explicitly. 

If  a  compound  transaction  must  abort,  an  AbortTransaction  primitive  (see  Section  4.3.6.4)  performs 
the  necessary  compensate  actions,  breaks  the  current  transaction  scope,  and  passes  control  to  the 
end  of  the  transaction  scope. 


CT(timeout){  . . .  <transaction  steps>  . . . } 

TIME  timeout  The  timeout  value  indicates  the  maximum  lifetime  of  this  compound  transaction. 
<transaclion  steps>  Atomic  objects  can  only  be  accessed  and  altered,  within  these  transaction  steps. 


On  Timeout:  The  current  transaction  and  all  of  its  child  transactions  will  be  aborted.  That  is, 
ArchOS  will  execute  all  of  the  necessary  compensate  actions  and  undo.  After  completion  of  these 
actions,  the  status  of  atomic  objects  should  be  consistent  and  be  one  of  the  member  of  the 
equivalence  class  of  its  pre- (transaction)  execution  state. 

4. 3. 6. 2  Elementary  Transaction 

An  elomonfary  transaction  construct  also  creates  a  now  Iratisaoliori  scope  in  a  client  proce.ss. 
Within  this  scope,  a  client  can  access  atomic  objects  as  if  tho.se  computational  steps  vvere  executed 
alone. 

When  an  elementary  transaction  starts,  it  can  obtain  its  anccolor  transaction’s  locks  if  there  is  an 
ancestor.  1  fiat  is,  thLs  olemcntary  transaction  may  access  atomic  objects  which  were  manipulated  by 
the  higher  love)  transaction,’;.  If  the  ancestor  has  modified  atomic  objects,  these  modifications  will  be 
visible  to  this  tran.saction.  At  the  end  of  an  elementary  ti  ancaction  scope,  .ArchOS  v/iil  pro(/.Kjatr:)  all  of 
its  locks  to  if’.G  p-orenl  tranr.aciion  if  one  exists.  If  (Ite  eisinent.ory  Iransaciioti  is  ttu)  toiimost 


57 


transaction,  then  ail  of  its  locks  will  be  released  at  this  point  and  all  of  its  atomic  objects  will  be 
committed. 

If  an  elementary  transaction  must  abort,  an  AbortTransaction  primitive  (see  Section  4.3.6.4) 
performs  the  necessary  "undo"  actions,  breaks  the  current  transaction  scope,  and  passes  the 
current  control  to  the  end  of  the  transaction  scope. 


ET(timeout){  . . .  <transaction  steps> . . . } 

TIME  timeout  The  timeout  value  indicates  the  maximum  lifetime  of  this  elementary  transaction. 
<transaction  steps>  Atomic  objects  can  only  be  accessed  and  altered,  within  these  transaction  steps. 


On  Timeout.  The  current  transaction  and  all  of  its  child  transactions  will  be  aborted.  That  is, 
ArchOS  will  execute  all  of  the  necessary  undo  and  compensate  actions  automatically.  After 
completion  of  these  actions,  the  status  of  all  affected  atomic  objects  should  be  consistent  and  be 
"identical "  to  the  correct  initial  (pre-execution)  states. 

4.3. 6. 3  SelfTid  and  ParentTid 

The  SelfTid  primitive  returns  the  id  of  the  current  transaction  and  the  ParentTid  primitive  returns  the 
parent  transaction  id  of  the  given  transaction  id. 


mytid  =  SelfTidO 
ptid  =  ParentTid(tid) 

TID  mytid  The  id  of  the  current  transaction. 

TID  lid  The  id  of  the  specific  transaction, 

TID  ptid  The  parent’s  tid  of  the  given  transaction  "lid". 


On  Erior:  If  the  ParentTid  prirnilivt;  fi.iis,  a  "iNI  tLLTID"  v/ill  be  returned. 


4. 3. 6. 4  AbortTransaction 

TK.  AbortTransaction  primitive  aborts  the  specified  transaction  and  all  of  its  child  transactions 
within  the  same  transaction  tree  (See  Figure  4-3  in  Section  4.2.4.3).  If  the  transaction  that  invokes  the 
AbortTransaction  primitive  does  not  belong  to  same  the  transaction  tree  as  the  transaction  which  is  to 
be  aborted,  a  client  cannot  abort  that  transaction.  This  primitive  executes  all  of  the  necessary  "undo” 
or  "compensate"  actions,  based  on  the  transaction  type,  and  breaks  the  current  transaction  scope. 
After  completion  of  these  actions,  the  status  of  all  affected  atomic  objects  will  be  consistent  and 
returned  to  either  "identical"  to  or  "a  member  of  the  equivalence  class"  of  their  initial  (pre-execution) 
states. 

The  AbortIncompleteTransaction  primitive  also  aborts  all  of  the  outstanding  incomplete 
transactions  which  had  been  initiated  by  an  outstanding  RequestSingle  or  RequestAll  primitive.  In 
other  words,  all  of  the  nested  transactions  which  belong  to  the  specified  request  transaction  but  have 
not  yet  completed  (committed)  will  be  aborted. 


val  =  AbortTransaction(tid) 

val  =  AbortlncompleteTransactlon(req-tid) 

BOOLEAN  val  TRUE  if  the  transaction  was  aborted  successfully;  otherwise  FALSE. 
TID  tid  The  id  of  the  transaction. 

TID  req-tid  The  transaction  id  of  the  RequestSingle  or  RequestAll  primitive. 


Gn  Error:  If  an  Abort  primitive  is  called  from  a  transcation  which  is  not  a  parent  of,  or  identical  to,  the 
designated  transaction,  the  primitive  will  fail  and  return  FALSE. 

4. 3.0. 5  TrarisactionType 

Tlie  TransiictinnType  primitive  returns  the  type  of  tho  given  transaction  (a  compound  or  elementary) 
and  also  indicates  the  transaction  level. 


trantype  --  TransactionTypc(tid) 

THANTYPE  trantype 

The  type  of  tho  given  tr.insactiun,  such  as  "CT",  "ET",  "Mesind  CT",  or  "Ne:'tc<'l 
ET". 


I  iO  lid 


Tho  id  wf  il'o  ti  .insaction. 


59 


On  Error:  If  there  is  no  specified  transaction  in  its  transaction  tree,  the  TransactionType  primitive 
fails  and  returns  "NULL  TR AN-TYPE". 

4. 3. 6. 6  IsCommitted 

The  IsCommitted  primitive  checks  \A/hether  the  given  transaction  is  already  committed  or  not. 

val  s  isCommitted(tid) 

BOOLEAN  val  TRUE  if  the  specified  transaction  was  committed:  otherwise  FALSE. 

TID  tid  The  id  of  the  transaction. 


On  Error:  If  there  is  no  specified  transaction  in  its  transaction  tree,  the  IsCommitted  primitive  fails 
and  returns  "FALSE". 

4. 3. 6. 7  IsAborted 

The  IsAborted  primitive  checks  whether  the  given  transaction  is  already  aborted  or  not. 

val  =  IsAborted(tid) 

BOOLEAN  val  TRUE  if  the  specified  transaction  was  aborted;  otherwise  FALSE. 

TID  tid  The  id  of  the  transaction. 


On  Error:  !f  there  is  no  specified  transaction  in  its  transaction  tree,  ths  IsAborted  primitive  fails  and 
returns  "FALSE". 


4.3.7  r-ilo  Management 


Viewing  a  file  as  a  cot  of  long  term  persistent  data,  it  is  clear  that  af»  arobject  can  also  fulfill  this  role. 
To  create  a  file,  an  instance  of  the  apptopriate  arobject  can  be  created,  and  Hk’  filename  can  be 
bound  to  it.  To  erase  the  file,  the  Kill  primitive  will  servo.  Reading  from  and  writi-ig  to  the  file  can  be 
performoc!  iir  iny  the  appropriate  opeiations  of  the  .'rcf.'l- -ct  itself.  F-imilarly,  contn  '  functions  (c;.y. 


back';'(K 


aidoni  [>laci'ii!e'nt,  .se.'irch)  Ijocomo  oij-  i'.aici:;;  of  the  aioLiect.  Tlie  d  to  b:;  cicr.:d  in 


60 


the  file  is  merely  contained  in  one  of  the  arobject's  private  data  objects.  If  this  private  data  object  is 
declared  to  be  atomic,  then  the  file  will  be  treated  as  an  atomic  file  and  all  of  the  file  accesses  must  be 
performed  from  within  a  transaction  scope. 

4.3.7. 1  File  Access  Interface 

In  order  to  avoid  the  low  level  (bare)  access  to  a  file  arobject  (i.e.,  an  explicit  invocation  of  an 
operation  on  a  file  arobject),  ArchOS  provides  the  following  set  of  primitives  which  support 
conventional  file  access  and  control  functions. 


The  OpenFile  primitive  opens  a  specified  file  with  the  given  access  mode,  and  the  CloseFilo  file 
primitive  closes  the  file.  The  ReadFile  and  WriteFile  primitives  provide  a  simple  "byte-stream" 
oriented  read  and  write  access  to  a  file,  respectively.  The  CreateFUe  primitive  creates  a  file  of  a  given 
file  type  and  the  DeleteFile  primitive  deletes  a  file.  The  SeekFile  primitive  moves  the  current  reading 
or  writing  position  to  a  specified  by  bytelocation  in  the  file. 


fd  a  OpenFile(filename,  mode) 
val  a  CloseFile(fd) 
nr  a  ReadFile(fd,  buf,  nbytes) 
nw  a  WriteFile(fd,  buf,  nbytes) 
val  a  CroaleFils(filename,  filetype) 
val  a  DeleteFilc{filename) 
pos  =  SeekFile(fd,  offset,  origin) 


FILEDESCRIPTOR  ‘fd 

A  pointer  to  the  file  descriptor. 

BOOLEAN  val  TRUE  if  the  specified  operation  is  done  successfully:  otherwise  FALSE. 

IMT  nr  The  actual  number  of  bytes  which  wore  read. 

INT  nw  Tlic  actual  number  of  bytes  which  v/ere  written. 

FiLi>J/'iME  filenanie 

Tile  name  of  tho  file. 

ACCESSMODE  mode 

The  mode  for  accessing  tho  file,  (e.g.,  "READ",  "WRITE",  "APPEND", 
"READLOCK",  "VVRI7GLOCK",  etc). 

FILE-  TYPE  filei.ypo  Tho  type  of  llw'  .•ipo.sified  file  such  as  "ATOMIC",  "PERMAMENT"  or  "NORMAL". 
nillTLR  'buf  TIV;bMtfi;i  fd.lios-'.. 


61 


INT  nbytes  The  number  of  bytes  to  be  read  or  written  during  a  Read  or  Write  operation, 

respectively. 

LONGINT  pos  The  current  pointer’s  position  in  the  file  in  bytes.  If  the  seek  action  is  done 
successfully,  the  value  of  pos  must  be  a  positive  integer;  otherwise  -1 . 

FILEOFFSET  offset  The  offset  value  from  the  given  origin  point,  in  bytes. 

FILEORIGIN  origin  The  origin  indicates  the  origin  of  the  seek  operation.  (At  the  beginning,  the 
current  position,  or  the  end  of  the  file). 


On  Error-.  If  the  file  does  not  exist  or  the  file  mode  is  not  supported,  then  the  OpenFile  primitive  fails 
and  returns  "NULL-FILE-DESCRIPTOR".  If  the  file  descriptor  is  not  an  opened  descriptor,  then  a 
close  or  read/write  action  fails  and  "-1 "  will  be  returned  and  a  detailed  error  condition  will  be  also  set 
in  the  caller’s  error  block.  If  a  create/delete  action  fails,  then  "FALSE"  will  be  returned  and  a  detailed 
error  condition  will  be  also  set  in  the  caller’s  error  block.  If  the  specified  file  does  not  exist  or  the 
value  of  offset  or  origin  is  not  in  the  proper  range,  "-1"  will  be  returned  and  a  detailed  error  condition 
will  be  also  set  in  the  caller’s  error  block. 

4.3.8  I/O  Device  Management 

A  normal  I/O  device  access  protocol  should  be  similar  to  the  file  access  protocol  cloccribed  in 
Section  4.3.7.  To  read,  write  or  send  a  special  command,  the  target  device  must  be  opened.  Once  all 
actions  are  done,  it  must  to  be  closed.  All  device  dependent  commands  can  be  sent  to  devices  by 
using  the  SetlOControl  primitive. 

At  the  lowest  level,  I/O  control  and  data  transfer  will  be  handled  according  to  the  hardware 
interface  definition  (o.g.,  memory  read  and  write  to  memory  mapped  devicas).  The  lOW.'tit  primitive 
can  be  used  to  wait  for  an  interrupt  from  a  particular  device;  at  mo.vt  ono  process  may  wait  for  a 
particular  device  interrupt  at  one  time. 

4.3.8. 1  Basic  I/O  Device  Access  Interface 

Typically,  access  to  a  device  is  performed  hy  the  fcllov/inn  primitives.  If  a  device  i.s  not  readable*  or 
writeable,  then  a  special  SetlOControl  primitive  (.see  Section  4.3.H  3)  mu.st  be  used. 


dd  =  OpenDovi{,y(devicc  name,  mode) 
val  -  Clo'.i>Oovico(f!d) 
nr  ;  f!  :a(inovic<,‘(dd.  l.’of,  nt)ytrc) 
nw  -  ’.'.'i!tL'Dnv!co(dd,  biif,  Mbyte''.) 


62 


DEVDESCRIPTOR  *dd 

A  pointer  to  the  device  descriptor. 

BCXDLEAN  val  TRUE  if  the  specified  operation  is  done  successfully:  otherwise  FALSE. 

INT  nr  The  actual  number  of  bytes  which  were  read. 

INT  nw  The  actual  number  of  bytes  which  were  written. 

DEVNAME  device- name 

The  device  name. 

ACCESSMODE  mode 

The  access  mode  of  the  specified  device  such  as  "READ",  "WRITE",  etc. 


BUFFER  ‘buf  The  buffer  address. 

INT  nbytes  The  number  of  bytes  to  be  read  or  written  during  a  ReadDevice  or  WriteDevice 

operation,  respectively. 


63 


On  Timeout:  If  an  lOWait  primitive  cannot  complete  within  the  specified  timeout  value,  a  negative 
value  will  be  returned. 

4. 3. 8. 3  SetlOControl 

The  SetlOControl  primitive  sends  device  control  information  to  the  specified  device.  It  also  receives 
status  information  from  the  device. 


val  =  SetlOControl(dev-descriptor,  io-command,  dev-buf,  timeout) 

BOOLEAN  val  TRUE  if,  this  SetlOControl  succeeded:  otherwise  FALSE. 

DEV-DESCRIPTOR  ‘dev-descriptor 

A  pointer  to  the  device  descriptor. 

IO-COMMAND  io-command 

The  io-command  indicates  a  device-specific  control  command. 

DEV-BUF  ‘dev-buf  The  dev-buf  indicates  a  pointer  to  a  buffer  which  will  be  filled  with  device  status. 
TIME  timeout  The  timeout  value  indicates  the  maximum  execution  time  of  this  I/O  function. 


On  Error:  If  tho  device  descriptor  is  not  an  opened  descriptor,  then  a  control  action  fails  and 
"FALSE"  will  be  returned  and  a  detailed  error  condition  will  be  also  set  in  the  caller’s  error  block. 

On  Timeout:  If  an  SetlOControl  primitive  cannot  complete  within  the  specified  timeout  value,  a 
negative  value  will  be  returned. 

a. 3. 9  Time  Management 

In  ArchOS,  the  time  management  not  only  provides  a  basic  access  to  iho  systetn  real  time  clock 
which  is  maintained  in  ncn-volatile  storayc,  but  also  support.^  primitivv.a  vdtich  provide  the  ba.sic 
functions  of  tho  time-driven  (process)  scheduling. 

The  GetReniTime  primitive  obtains  the  system’.s  cm  rent  real  time  clock  value,  while  the 
GijiTiinoD'ile  ptiinitive  fetches  tho  current  absolute  time  and  date.  The  Dcloy  primitive  delays  the 
caller's  execution  for  the  Si/ccifi  ;cl  time  period.  The  AUirm  primitive  also  pocfpoiies  the  oallor  until  tho 
specified  time  of  day  and  dat:).  The  Ddny  and  Alnrm  primiiives  provide  '^periodic  time  dependent 
pret,r  '.r;ing  vdiile  periodic  o.jpr  fiMvo  prc.ci^r.sing  ',,:,'l  normally  tre  ■..r;r;t(o!!r.f|  usiag  Iho  Policy 

p.iM'itivcc  (.see  ■  .iiun  '1.3.10). 


64 


4.3.9. 1  GetRealTime 

The  GetRealTime  primitive  returns  the  current  real  time  clock  value  in  microseconds. 


rtc  =  GetRealTimeO 


REALTIME  rtc  Value  of  current  real-time  clock  in  microseconds.  This  value  may  be  used  to 
compute  time  values  for  use  in  delay  or  alarm  primitives. 


4. 3. 9. 2  GetTimeOate 

The  GetTimeOate  primitive  returns  the  current  absolute  time  and  date.  The  time  value  accuracy  will 
be  limited  by  delays  in  the  calibration  entry  performed  by  the  application,  and  cannot  be  expected  to 
return  exactly  the  same  value  simultaneously  at  every  node.  ArchOS  will  maintain  this  information  on 
a  best  effort  basis. 


(time,  date)  =  GetTimeOateO 

TIME  time  Value  of  current  time  of  day  in  microseconds  from  midnight. 

DATE  date  Julian  date  of  this  day. 


On  Error:  If  the  SetTimeDate  primitive  operation  has  not  been  executed  since  the  system  was 
initialized,  the  date  value  returned  is  zero. 

4. 3. 9. 3  Delay 

The  Delay  primitive  delays  a  specified  length  of  time  (in  microseconds),  then  returns  the  current 
date  and  time  'when  the  delay  has  been  completed  and  e;xculion  iias  resumed.  This  primitive  can 
also  be  used  to  specify  a  deadline,  by  which  time  a  deadline  milestone  must  be  reached,  as  vvoll  as  an 
estimate  of  the  amount  of  processing  time  that  will  be  required  to  reach  the  deadline  milootone. 
(ArchOS  may  also  make  estimates  about  this  processing  time  based  on  oboorvations  of  earlier 
executions.)  Finally,  the  client  may  use  the  Delay  primitive  to  mark  the  arrival  of  the  process  at  the 
deadline  milestone  referred  to  as  the  current  deadline  (while  optionally  defining  the  next  deadline 


(time,  date)  =  Delay(delaytime,  deadline,  util,  dIflag) 

REALTIME  delaytime 

Time  to  be  delayed  starting  at  the  present  time,  in  microseconds.  No  delay  if  this 
value  is  not  positive;  in  this  case  it  replies  immediately. 

REALTIME  deadline 

Elapsed  time  in  microseconds  from  delaytime  by  which  the  deadline  milestone 
must  be  reached.  If  this  value  is  not  positive,  the  deadline  remains  unchanged. 

REALTIME  util  Estimated  execution  time  of  this  process  in  microseconds  which  will  be  required 
before  the  deadline  is  reached.  If  this  value  is  0  or  less,  the  current  processing 
time  estimate  remains  unchanged.  The  use  of  this  value  by  ArchOS  is  defined  by 
the  policy  mechanism  (see  Section  4.2.6). 

TIME  time  Current  time  of  day  in  microseconds  since  midnight. 

DATE  date  Julian  date  of  this  day. 

BOOLEAN  dIflag  Client  sets  to  "TRUE"  if  this  call  is  intended  to  mark  the  arrival  of  this  process  at 
the  milestone  referred  to  as  the  current  deadline. 

On  Error.  If  the  SetTimeDate  primitive  operation  has  not  been  executed  since  the  cystem  was 
initialized,  the  date  value  returned  is  zero. 

4. 3. 9. 4  Alarm 

The  Alarm  primitive  waits  until  the  specified  time  of  day  and  date,  then  replies  with  the  current  time 
and  date.  If  the  specified  time  has  already  passed,  it  replies  immediately.  (The  accuracy  of  the  day 
and  date  is  as  described  in  the  GetTimeOate  primitive  above.)  Like  the  Delay  primitive,  the  Alarm 
primitive  can  also  to  used  to  s(;ecify  a  deadi'rie*,  by  wliicli  time  a  u'oadline  rui'Lslorio  must  l.v;  reacheu, 
as  ••/'/ell  as  an  estimate  of  the  amount  of  processing  time  required  to  rcacii  the  deadline  mile.stone. 
(ArchOS  may  also  make  estimates  about  tliis  processing  time  based  on  f'bservations  of  earlier 
executions.)  Finally,  the  client  may  use  the  Alarm  primitive  to  mark  the  arrival  of  the  process  at  the 
deadline  milestone  referred  to  as  the  current  deadline  (while  optionally  defining  the  next  deadline 
milestone). 

(time,  date)  -  Al.arm(alarmtiino,  deadline,  util,  dIflag) 
nr.  A  LT I  Ivl  C  alar  1 1 '  •  i  mo 

firno  and  Dat/;  at  '//hich  pr{)ct:.-v.;inrj  of  lhi.o  i.iOcoss  in  requosicu  to  lesnme. 


66 


REALTIME  deadline 

Elapsed  time  in  microseconds  from  delaytime  by  which  the  deadline  milestone 
must  be  reached.  If  this  value  is  not  positive,  the  deadline  remains  unchanged. 

REALTIME  util  Estimated  execution  time  of  this  process  in  microseconds  which  will  be  required 
before  the  deadline  is  reached.  If  this  value  is  0,  the  current  processing  time 
estimate  remains  unchanged.  The  use  of  this  value  by  ArchOS  is  defined  by  the 
policy  mechanism  (see  Section  4.2.6). 

TIME  time  Current  time  of  day  in  microseconds  since  midnight. 

DATE  date  Julian  date  of  this  day. 

BOOLEAN  dlflag  Client  sets  to  "TRUE"  if  this  call  is  intended  to  mark  the  arrival  of  this  process  at 
the  milestone  referred  to  as  the  current  deadline. 

On  Error:  If  the  SetTimeDate  primitive  operation  has  not  been  executed  since  the  system  was 
initialized,  the  date  value  returned  is  zero. 

4. 3. 9. 5  SetTimeDate 

The  SetTimeDate  primitive  sets  the  current  absolute  time  and  date.  The  date  will  be  written  into  the 
ArchOS  d.ata  base  and  continually  adjusted  by  ArchOS.  ArchOS  will  maintain  this  information  on  a 
best  effort  basis.  This  primitive  should  be  used  only  once,  when  the  application  is  initiated. 

val  =  SetTimeDate(time,  dale) 

BOOLEAN  val  TRUE  if  time  and  date  were  .set;  otherwise  FALSE. 

TIME  time  Value  of  current  time  of  day  in  microseconds  from  midnight. 

DATE  date  Julian  date  of  this  day. 

On  Error:  If  the  time  and  dale  were  already  set  by  this  primitive,  they  remain  unchanged,  and  val  is 


set  to  "FALSE". 


67 


4.3.10  Policy  Management 

In  ArchOS,  a  policy  management  is  carried  out  by  a  policy  set  arobject  and  a  set  of  policy  modules. 
The  policy  set  arobject  must  exist  in  a  distributed  program  and  maintain  a  directory  of  the  policy 
modules.  The  actual  policy  will  be  implemented  as  a  separate  policy  module  in  ArchOS.  A  set  of 
policy  attributes  are  also  maintained  by  the  policy  set  arobject  and  ArchOS  and  will  be  to  referred  by 
the  policy  modules. 

A  SetPolicy  primitive  specifies  a  new  policy  module  to  carry  out  a  designated  policy  in  the 
applications's  policy  set  arobject.  If  no  SetPolicy  primitive  is  invoked,  ArchOS  provides  a  default 
policy  module  for  the  application  program.  A  SetAttribute  primitive  sets  an  attribute  into  a  new  value. 


val  =  SetPolicy(policy-name,  policy- module) 
val  =  SetAttribute(attr-name,  attr-value) 


BOOLEAN  val  TRUE  if  the  specified  policy  was  set  properly:  otherwise  FALSE. 

POLICY-NAME  policy-name 

The  name  of  the  policy  to  be  set.  (Well-known  name  to  ArchOS,  such  as 
SCHEDULE.) 

POLICY-MODULE  policy-module 

The  name  of  the  policy  module  which  contains  the  policy. 

ATTRIBUTE-NAME  attr-name 

The  name  of  attribute  to  be  set. 

ATTRIBUTE-VALUE  attr-value. 

The  actual  value  for  the  attribute. 


On  Error:  If  there  is  no  policy  name  or  policy  body,  the  SetPolicy  primitivo  fails  and  a  "FALSE"  will 
be  returned.  The  SotAtiribute  primitive  fails  if  a  .';{)ccified  attribute  is  not  defined  or  inappropriate 
values  is  assigned. 


4.3.1 1  System  Monitoring  and  C-’d'igging  Support 
ArchOS  provides  svetem  monitoring  and  debugging  primitives  m  order  to  control  (im  complexity  of 
app!i<miion  riovelopmont  in  a  di.slri'ouled  cnvir/)tim..,'^!t.  A  system  monitoring  fnciliiy  can  provide  for 
trae.I.ing  iho  b-..'haviui  of  .'(biimry  aro'jjecl.s  cr  pror-'cses  in  the  sy.ctcm.  The  (V.immunication  -activity 
aru'-'.'.cj  .ii  oljjccts  can  'uo  also  a'onitor--,  I  by  intorcept.-at;  the  :::-;i.?'.dive  inc.:  jage. 


68 


4.3.11.1  Freeze  and  Unfreeze 

A  FreezeAllApplications  primitive  stops  the  entire  activities  of  caller's  application,  and  a  FreezeNode 
primitive  halts  all  of  the  cleint’s  activities  in  a  specific  node.  To  resume  client’s  application, 
UnfreezeAllApplications  or  UnfreezeNode  \«ill  be  used. 

A  FreezeArobject  primitive  stops  the  execution  of  an  arobject  (i.e.,  all  of  its  processes),  and  a 
FreeseProcess  primitives  halts  a  specific  process  for  inspection.  An  UnfreezeArobject  and 
Un‘reezeProcess  primitive  resumes  a  suspended  arobject  and  process  respectively.  While  a  process 
is  in  a  frozen  state,  many  of  the  factors  used  for  making  scheduling  decisions  can  be  selectively 
ignored.  For  instance,  a  timeout  value  will  be  ignored  by  specifying  a  proper  flag  in  the  Freeze 
primitive. 


val  =  FreezeAllApplicationsO 

val  =  UnFreezeAilApplicationsO 

val  =  FreezeNode(node-id) 

val  =  UnfreezeNode(node-id) 

val  =  FreezeArob|ect(arobj-id  [,  options]) 

val  =  UnfreezeArobject(arobj-id  [,  options]) 

val  =  FroezeProcessfpid  [,  options)) 

val  =  UnfreezeProcess(pid  [,  options]) 


BOOLEAN  val  TRUE  if  the  primitive  was  executed  properly;  otherwise  FALSE. 

NODE-ID  node-id  The  node  id  indicates  the  actual  node  which  wil  be  stopped. 

AID  arobj-id  The  unique  arobject  id  of  an  arobject  instarjce. 

PID  pid  The  process  id  of  the  target  process. 

FREEZE-OPT  options 

The  oplicnrj  indicate  various  selectable  flags  such  us  a  timeout  freu/'e/unfreeze 
flag. 


On  Error:  An  error  condition,  such  as  non-existent  node  id,  arobject  id  or  pid,  will  be  noted,  and 
detailed  information  regarding  the  error  status  will  bo  available  by  looking  at  the  caller’s  error  block. 
If  tfie  target  node,  arobject,  or  process  is  already  unfrozen,  then  the  Unfrcc/.e  action  v/ill  bo  ignored 
and  FALSE  will  b  returned  along  with  the  appr  opriate  error  information  is  in  the  error  block. 


69 


4.3.1 1 .2  Fetch  and  Store  Arobject  and  Process’  Status 
A  Fetch  primitive  inspects  the  status  of  a  running  or  frozen  arobject  or  process  in  terms  of  a  set  of 
frozen  values  of  private  data  objects.  The  specific  state  of  the  arobject  or  process  will  be  selected  by 
a  data  object  id.  The  state  includes  not  only  the  status  of  private  variables,  but  also  includes  process 
control  information. 


fval  =  FetchA  rob  jectStatus(arobj-id,  dataobj-id,  buffer,  size) 
sval  =  StoreArobiectStatus(arobj-id,  dataobj-id,  buffer,  size) 
fval  =  FetchProcessSlatus(pid,  dataobj-id,  buffer,  size) 
sval  =  StoreProcessStatus(pid,  dataobj-id,  buffer,  size) 


INT  fval  The  actual  number  of  bytes  which  were  fetched. 

INT  sval  The  actual  number  of  bytes  which  were  stored. 

AID  arobj-id  The  unique  id  of  the  arobject  instance. 

PID  pid  The  process  id  of  the  target  process. 

DATAOBJ-ID  dataobj-id  The  dataobj-id  indicates  the  private  object  or  system  control  status  of  the 
target  arobject/process. 

BUFFER  'buffer  A  pointer  to  the  buffer  area  for  storing  the  returned  data  object  value. 

INT  size  The  size  indicates  the  buffer  size  in  bytes. 


On  Error:  An  error  condition,  such  as  non-existent  arobject  id  or  pid,  will  be  noted,  and  detailed 
information  regarding  the  error  status  will  be  available  by  looking  at  the  caller’s  error  block.  If  the 
fetclied  data,  object  is  larger  than,  the  specified  buffer  size,  then  the  content  will  be  fruncated.  For  the 
storing  operations,  the  data  object  .size  must  be  equal,  otherwise  tiis  value  will  not  be  replaced. 

4.3.1 1 .3  Kill  Arbitrary  Arobject/Process 
A  GlobnlKill  primitive  can  destroy  on  aibiirary  arrii'.ject  or  process  in  the  system. 


nproc  -  Gloi)atKilIArobjeci(-;irobj-id  [,  optionsl) 
val  -  GiobalKitlProccss(pid) 

INT  nproc  Tfio  aciu.'il  numbi.i  of  killed  proct.sses. 


70 


BOOLEAN  val  TRUE  if  the  specified  process  was  killed;  otherwise  False. 

AID  arobj-id  The  unique  id  of  the  arobject  instance. 

PID  pid  The  process  id  of  the  target  process. 

GKILL-OP  options  The  options  indicate  various  control  options.  For  example,  it  can  indicate 
whether  the  caller  stops  every  time  after  killing  a  single  process  or  not. 


On  Error:  An  error  condition,  such  as  non-existent  arobject  id  or  pid,  will  be  noted,  and  detailed 
information  regarding  the  error  status  will  be  available  by  looking  at  the  caller's  error  block.  If  the 
target  arobject  or  process  was  already  killed,  then  no  action  will  be  performed  and  "FALSE"  will  be 
returned. 

4.3.1 1 .4  Monitor  Message  Communication  Activities  for  Arobject/Process 
The  CaptureComm  primitives  capture  on-going  communication  messages  from  the  specified 
arobject  or  process.  A  CaptureCommArobject  primitive  captures  all  of  the  incoming  request  and 
outgoing  reply  messages  to  a  specified  arobject  and  can  select  a  target  message  based  on  the  name 
of  the  operation'.  A  CaptureCommProcess  primitive  captures  all  of  the  incoming  messages  and 
outgoing  reply  messages  for 

The  WatchComm  primitives  are  similar  to  CaptureComm  primitives  except  that  all  of  the  monitored 
messages  are  duplicated,  not  captured. 


val  =  CaplijreCommArobject(arobj-id,  commtype,  requestor,  opr) 
val  =  CaplurGCommProcoss(pid,  opr) 

val  =  WatchCommArobjectfarobj-id,  commtype,  requestor,  opr) 
v.al  =  WatnhCnnHnProcer.s(pid,  opr) 


DOOLEAN  val  TRUE  if  the  monitoring  action  was  initirted  oucce-sskiily;  otherwise  FALSE. 
AID  arobj-id  The  unique  id  of  the  arobject  instance. 

PID  pid  Tfie  process  id  of  the  target  process. 

MSG-Q  commtype  This  indicates  cither  "REOUEGT"  or  "REPLY  "  type. 

AID  requestor  The  aid  of  'itio  communicating  arobject. 


of:  i-GEl  ECTOR  0[)r 


71 


The  operation  to  be  performed.  The  "opr"  parameter  can  be  a  specific  operation 
name  or  "ANYOPR". 


On  Error:  An  error  condition,  such  as  non-existent  arobject  id  or  pid,  will  be  noted,  and  detailed 
information  regarding  the  error  status  will  be  available  by  looking  at  the  caller's  error  block. 

4. 3. 11. 5  SetErrorBlock 

A  SetErrorBlock  primitive  sets  an  error  block  in  a  process’s  address  space.  A  user  error  block 
consists  of  a  head  pointer  and  a  circular  queue.  The  head  pointer  contains  a  pointer  to  an  entry 
which  contains  the  latest  error  information  in  the  circular  queue.  After  the  execution  of  this  primitive, 
a  client  can  access  the  detailed  error  information  from  the  specified  error  block. 

It  should  be  noted  that  the  SetErrorBlock  primitive  will  be  executed  at  the  process  creation  time  (in 
the  library  routine),  so  that  the  system  default  error  block  will  be  set  automatically. 


val  *  SetErrorStack(errblock,  blocksize) 

BOOLEAN  val  TRUE  if  the  error  block  is  set  successfully;  otherwise  FALSE. 

ERROR-BLOCK  ‘errblock 

The  address  of  error  block. 

INT  blocksize  The  size  of  error  block  in  bytes. 

On  Error:  If  the  address  of  the  error  block  is  not  valid,  then  the  primitive  fails  and  returns  FALSE. 

4.4  Ralionale  for  tho  ArchOS  Client  Interface 

This  chapter  is  organized  in  parallel  with  the  organization  of  the  first  chapters  of  this  clocument.  A 
rational  approach  to  reading  this  chapter  would  be  to  re.-neve  this  chcipter  from  the  document  placing 
it  side  by  side  with  the  rem.uining  sections  of  the  document  and  then  reading  it  in  parallel  v/ith  the 
points  made  in  the  remaining  sections. 


72 


4.4.1  Introduction 

The  preceding  sections  describe  in  some  detail  the  specifications  for  the  ArchOS  client  interface. 
Most  specification  documents  would  end  here  having  as  completely  as  possible  specified  the 
operations  and  the  expected  responses  of  the  operating  system.  This  specification,  and  others  like  it, 
however,  embody  the  results  of  a  large  number  of  decisions.  This  chapter  is  designed  to  describe  the 
rationale  for  the  decisions  made.  Obviously  not  every  decision  can  be  completely  described  here.  It's 
entirely  possible  that  there  will  be  a  number  of  important  decisions  which  we  will  not  describe,  but  our 
attempt  is  to  describe  all  those  trade-offs  that  we  have  consciously  made  with  respect  to  the  overall 
functions  involved.  We  will  try  to  identify  alternative  configurations  that  we  had  discussed  and  will  try 
to  identify  the  reasons  for  the  particular  decisions  made. 

4.4.2  ArchOS  Computational  Model 

The  rationale  for  this  section  must  perhaps  be  the  most  incomplete,  since  there  are  of  course,  an 
large  number  of  choices  one  can  make  for  the  computational  model  of  a  distributed  system. 
Distributed  systems  have  been  built  using  models  varying  from  systems  built  on  a  star  configuration 
where  one  node  is  completely  in  charge  of  the  system  and  the  others  operate  in  a  slave  relationship, 
to  autonomous  systems;  networks  of  multiple  processors  tied  together  and  communicating  to  solve 
either  a  common  problem  or  a  large  set  of  disjoint  problems.  Tho  purpose  of  ArchOS  is  to  build,  as 
we  have  stated,  a  distributed  computer  (i.e.,  a  set  of  processing  nodes  which,  operating  together,  act 
as  a  single  functional  entity  to  solve  a  particular  application  problem).  Thus  we  needed  a 
computational  model  that  would  reflect  the  unity  of  purpose  inherent  in  such  a  concept. 

In  addition,  we  were  concerned  with  the  software  engineering  aspects  of  the  application  design.  In 
many  existing  real  time  systems  (uniprocessors  as  well  as  multiple  processor  systems)  we  find  that 
there  is  a  strong  tendency  to  design  the  application  software  along  the  lines  of  the  operating  system 
interface,  thus  cau.sing  application  paititioning  to  occur  in  the  program  at  points  that  do  not 
correspond  to  the  application  problem  itself.  This  effect  is  perhaps  most  clearly  illustrated  with  a 
standard  Navy  real  time  operating  system  in  which  each  event  must  bo  handled  by  a  user  process 
specifically  designed  to  handle  the  event.  Thus  a  single  process  may  not  handle  both  time  an.d  I/O 
events,  and  sequential  I/O  operations  must  be  performed  by  separate  process  invocations,  resulting 
in  a  very  disjoint  (and  non-modular)  program  structure.  This  creates  a  problem  with  the  reliability  and 
.maintainability  of  the  application  system.  Although  ArchOS  is  not  a  producti<.in  system,  v/e  expect 
that  r.vefitu,:.lly  a  production  .system  will  bo  built  along  the  lines  explored  by  this  Archons  research  and 
llicreforj  v;e  would  li!<.e  to  start  with  a  computational  model  v'iiich  'vill  lend  itself  tu  gord  snflVvaro 
;;ncjine-  :in  j  ;.;fn.clic(;s.  'n  today’s  !<.('iinology  we  felt  ibis  ineluded  lirat  cf  all  the  need  lu  dr. the 


73 


application  as  instantiations  of  abstract  data  types,  but  here  we  wanted  to  ensure  that  the  abstract 
data  types  we  produced  could  exploit  our  distributed  environment. 

It  is  also  true  that  in  existing  real  time  systems,  and  particularly  in  command  and  control  systems 
such  as  we  are  considering  for  our  target  applications,  the  programs  are  frequently  very  large  and  are 
constructed  by  large  teams  of  programmers.  We  would  like  to  have  a  computational  model  which 
would  allow  not  only  good  software  engineering  practices  with  respect  to  small  programs  (i.e., 
programming  in  the  small),  but  also  with  respect  to  the  problems  of  building  software  in  the  large. 
Hence,  we  have  chosen  a  large  primary  entity  for  our  computational  model,  the  arobject.  The  size  of 
the  arobject  and  its  clean  interface  lends  itself  to  a  reasonable  organization  of  software  engineers  for 
development  purposes.  In  addition,  the  arobject  to  arobject  interface  from  is  sufficiently  simple  to 
allow  a  software  engineering  organizational  break-out  along  arobject  lines.  This  is  intended  to 
simplify  not  only  the  software  development  of  a  highly  modular  application,  but  also  the  test  and 
evaluation  phase  of  such  a  large  system. 

4.4.2. 1  Principal  Components 

Clearly,  our  choice  for  the  top  level  principal  component  (i.e.,  the  distributed  program),  is  a  natural 
one  in  light  of  the  fact  that  the  Archons  project  is  producing  a  distributed  computer  and  an  operating 
system  for  that  computer.  We  see  the  distributed  program  as  being  a  single  entity  with  respect  to  its 
overall  function,  although  it  is  made  up  of  a  number  of  much  smaller  components.  We  have  chosen 
not  to  be  concerned  with  the  execution  of  more  than  one  distributed  program,  although  we  have  not 
prohibited  it  since  more  than  one  distributed  program  may  need  to  be  present  duritig  application 
testing.  It  may  be  argued  that  we  are  merely  playing  with  words  here,  and  in  a  sense  we  are,  but  the 
only  distinction  between  a  single  distributed  program  and  more  than  one  is  that  resource  allocation 
priorities  between  arobjccts  in  the  distributed  program  are  defined,  but  resource  allocation  priorities 
between  arobjccts  in  different  distributed  programs  cannot  be  determined.  So  it  is  po.ssible  to  take 
t'.vo  programs  which  are  disjoint  and  merge  them  together  by  defining  these  iritOiTcInlionships. 
Clearly,  although  we  do  plan  to  be  able  to  operate  ArohOS  in  an  environment  with  more  that  one 
program  running,  we  cannot  claim  that  resource  management  decisions  will  he  made  fairly  between' 
the  two  systems. 

As  we  have  stated,  the  arobject  is  a  distributed  abstract  data  type.  W  :  have  taken  from  the  Ada 
model  the  concept  of  a  sep.iruted  specification  and  body.  Similarly,  the  specification  section  i.s  the 
only  portion  which  i.s  visible  in  the  cornputa'ional  sense  from  one  arobject  to  another.  We  see  the 
arobject  as  a  distributed  ob c.tr.act  rlats  lyp.e  which  can  bo  inst^’-  '  tr'd  more  than  onco  in  the 
r!i  dribute'.l  system. 


74 


We  should  note  that  an  instance  need  not  be  resident  on  a  single  node.  We  felt  that  tying  the 
instance  of  an  arobject  to  a  given  node  would  render  inflexible  the  potential  use  of  an  arobject  to 
handle  a  distributed  abstract  data  type.  The  decision  to  limit  objects  to  a  single  node  has  been  made 
in  the  design  of  a  number  of  systems,  such  as  the  Eden  system  [Aimes  83],  in  which  an  Eject 
(conceptually  somewhat  similar  to  an  arobject)  must  be  entirely  resident  on  a  single  node.  Obviously, 
the  decision  to  be  resident  on  a  single  node  has  the  advantage  that  a  common  address  space  can  be 
used  for  the  entire  object.  We  felt  that  not  requiring  the  entire  arobject  address  space  be  contained 
on  a  single  node  would  give  us  great  flexibility  in  application  design,  so  we  deliberately  chose  not  to 
require  such  an  organization. 

By  allowing  more  than  one  instance  of  a  given  arobject,  thus  to  be  bound  to  a  single  reference 
name,  we  have  made  it  possible  for  a  higher  level  arobject  to  completely  handle  a  distributed  abstract 
data  type.  One  could  envision,  for  example,  a  partially  replicated  directory  existing  on  a  number  of 
nodes  or  possibly  the  entire  set  of  nodes  in  a  distributed  system,  but  being  handled  by  a  single 
arobject  distributed  over  that  network.  The  specification  portion  identifies  the  entire  interface  to  such 
a  group  of  arobjecls  by  external  arobjects,  but  processes  within  the  arobject  itself  can  communicate 
with  each  other  across  the  various  nodes  regardless  of  the  node  boundaries.  Obviously,  the 
performance  of  such  a  system  must  be  taken  into  account  and  the  arobject  will  have  options  at  its 
instantiation  time  with  respect  to  ensuring  that  particular  subcomponents  (processes,  private  abstract 
data  type  instances,  and  so  on)  are  resident  on  common  nodes  or  separate  nodes  as  dictated  by  its 
requirements. 

The  ArchOS  file  system  illustrate.s  some  of  the  arobject  lifetime  CHAHACTEniSTlCS.  We  have 
chosen  at  this  point  to  make  very  simple  our  concept  of  the  file  system  by  simply  using  permanent 
instances  of  file  arobjects  (Aimes  83].  Such  arobjects  can  be  instantiated  or  killed  as  required  and  the 
set  of  their  permanent  arobject  id’s  then  becomes,  in  effect,  a  directory  of  the  files  being  kept  on  the 
system.  The  operations  of  these  arobjects  would  provkie  the  ptimiiives  required  to  access,  modify, 
update  and  delete  the  dtiia  within  the  arobject.  Wo  v/ould  envision,  however,  that  not  all  arobiccls 
would  be  permanently  instunliaied.  In  fact,  quite  a  number  of  theiii  rnighi  be  ineSantiatod  during  the 
system  start-up  operation  and  would  dio  automatically  when  the  program  is  terminated  for  any 
reason,  such  as  by  powering  off  the  system.  Thus,  an  arobject  instance  might  never  have  a 
permanent  form  in  existence. 


The  arobject  body  is  intended  to  itviplement  the  opuralions  described  in  the  aro'jject  op^ecilic.  iion. 
It  should  be;  noted  ihat  the  piivate  ah'Sir.'ict  data  tyfre:;  com,)rise  llie  (inly  form  of  sliai’  /.I  data,  bet.v.jcn 
pi0Cfc'.,sr‘r.  ’vithm  an  .iiet.jcct.  Three  r.iro  r,la;..'iic.!i  .'d- '.ir.’ct  rk.ta  t/pv.s  vvtiici;  contain  “he  i  it  .  ;!f 


75 


and  which  can  be  manipulated  only  via  the  defined  procedures  in  the  abstract  data  type. 
Encapsulating  atomic  or  permanent  data  within  a  private  abstract  data  type  allows  controlled  access 
to  shared  data.  ArchOS  will  require  that  any  procedures  which  access  atomic  data  must  have  a 
transaction  open  at  the  time  of  access.  Thus,  ArchOS  can  control  the  sharing  of  this  data,  and  when 
the  transaction  is  committed  or  aborted,  ArchOS  will  ensure  that  the  atomic  data  is  correctly  forced  to 
appropriate  stable  storage.  We  also  note  that  because  of  our  arobject  distribution  provisions,  the 
private  data  can  be  partitioned  among  multiple  nodes,  although  each  individual  private  abstract  data 
type  instantiation  must  be  fully  resident  on  a  single  node.  In  this  way  remote  procedure  call 
semantics,  including  parameter  passing  by  value,  will  be  used  if  the  process  is  not  co-located  with  the 
private  procedure  on  the  same  node.  It  is  intended  that  defining  the  private  abstract  data  types  in  this 
way  will  allow  us  to  directly  implement  such  distributed  applications  as  a  partially  replicated  directory 
and  in  fact,  an  example  of  such  an  implementation  can  be  found  in  Appendix  A  of  this  document. 

At  this  point,  let  us  observe  that  there  is  no  required  correlation  between  the  operations  of  an 
arobject  and  the  processes  defined  in  an  arobject  body.  It’s  anticipated  that  one  or  more  processes 
in  each  arobject  body  will  run  continuously,  pausing  from  time  to  time  to  check  for  the  existence  of  an 
operation  request  from  another  arobject.  If  such  a  request  is  not  found,  that  process  could  block 
awaiting  another  such  reciuest.  Thus,  the  operation  can  be  handled  by  any  process  within  the 
arobject,  which  allows  us  to  have  a  set  of  processes  with  essentially  identical  function,  handling 
operations  in  any  manner  that  the  application  desires.  The  binding  between  these  operations,  then,  is 
late  and  is  handled  dynamically  by  the  appropriate  processes.  If,  of  course,  the  designer  wants  to 
couple  the  processes  to  the  operations  he  need  only  specify  his  Accept  parameters  to  identify  which 
operations  tie  wishes  to  accept  at  each  point  and  therefore  he  can  bind  them  as  early  as  he  wishes.  A 
cost  of  this  approach  is  that  the  potential  implementation  error  of  omitting  the  handling  of  some 
operation  in  an  arobject  cannot  be  detected  at  system  generation  tinie,  but  we  feel  that  the  benefit  in 
terms  of  modularity  and  conourroncy  greatly  oulwoighs  this  problem. 

Processes  within  an  arobject,  of  course,  may  communicate  by  using  the  Croat.. 'P/of;.?s.s  primitive  to 
create  processes  and  passing  parameters  at  that  time,  or  they  may  invoke  operations  within  the 
arobject  using  a  Rvquest  primitivo,  using  either  operations  from  the  specification  part,  or  private 
operations  from  the  body.  These  private  operations  are  designed  to  make  it  possible  to  place 
operation  roquesir.  into  the  incoming  queue  from  internal  processing  which  are  sr  parable  from  those 
plac  ed  by  external  arobjects.  Obviously  prcce-sses  within  an  arobject  can  also  communicato  via 
shared  data,  using  hie  private  .abstract  data  types.  Any  scheduling  or  mutual  oxclusion  which  must  be 
handled  with  reap;  i,t  io  this  shared  data  would  he  hsudlod  vdihin  ihe  abslr.icl  data  types. 


76 


In  addition,  we  have  allowed  for  inclusion  of  private  arobjects  to  be  defined  within  the  body  of  an 
arobject.  Such  a  private  arobject  will  not  be  visible  to  external  arobjects,  and  could  be  used  for 
partitioning  operations  within  an  abstract  data  type.  This  technique  could  be  used  for  an 
implementation  of  a  partially  replicated  directory  by  placing  each  partition  in  its  own  arobject  and 
using  the  outer  arobject  to  distribute  the  lookup  and  scheduling  for  the  arobjects  containing  the  data. 
(See  Appendix  A  Solution  2.) 

One  may,  of  course,  question  our  contention  that  these  processes  are  lightweight  as  we  have 
defined  them;  that  their  scheduling  overhead  will  be  small  relative  to  the  speed  of  the  machine.  This 
is  certainly  our  intention,  but  we  will  have  to  weigh  this  against  the  implementation  requirements  to 
obtain  the  interfaces  we  need.  It  is  our  intention,  however,  to  minimize  the  state  required  to  be 
constructed  in  the  scheduling  of  a  process.  In  addition,  we  envision  eventually  building  special 
purpose  hardware  for  hc'.Mng  ArchOS,  one  objective  of  which  will  be  to  optimize  the  creation  of 
processes  and  the  resulting  context  swap  overhead. 

4. 4. 2. 2  Communication  Facilities 

4. 4. 2. 2.1  Rationale  for  Accept/Request  Rendezvous  Mechanism 

The  Request/Accept/Reply  primitives  were  selected  to  provide  the  means  of  communication 
among  arobjects.  These  primitives  allow  communicating  arobjects  to  rendezvous,  rather  than 
providing  a  master/slave  relationship  for  all  communications. 

The  master/slave  paradigm  was  rejected  because  it  seemed  too  restrictive.  Consider  that  a  major 
goal  of  ArchOS  is  to  support  decentralized  resource  management  by  collections  of  resource 
managers,  which  negotiate  to  reach  a  consensus  regarding  a  particular  management  decision.  The 
communications  involved  in  carrying  out  negotiations  among  resource  managers  are  not 
master/slave  in  nature;  they  are  better  characterized  as  communications  among  peers. 

The  Request/ Accept  rendezvous  captured  this  cense  of  peer  communication--in  order  for 
communiculion  to  take  place,  both  parties  involved  must  explicitly  act;  the  requestor  cannot  force  a' 
process  to  perform  an  Accept  for  a  particular  invocation. 

Beth  blocking  and  non-blocking  request  primitives  are  provided  to  give  the  client  a  very  flexible 
communication  facility. 


77 


4. 4. 2. 2. 2  Rationale  for  Broadcast  Request  Capability  (ReguestA//  Primitive) 

The  RequestAII  primitive  provides  the  capability  to  broadcaist  a  request  to  multiple  arobjects.  This 
feature  of  the  inter- arobject  communication  facility  was  very  directly  shaped  by  the  anticipated 
structure  of  ArchOS  internal  distributed  data  entities. 

It  is  assumed  that  the  ArchOS  operating  system  will  make  use  of  various  data  objects  that  must  be 
highly  reliable  and  available.  Such  objects  can  be  constructed  by  means  of  data  replication  and 
distribution  throughout  the  operating  system,  with  appropriate  use  of  the  ArchOS  transaction 
facilities.  In  addition,  ArchOS  will  almost  certainly  contain  multiple  instances  of  various  servers  (for 
instance,  file  servers  and  name  servers).  If  data  is  either  partially  or  completely  replicated  at  multiple 
locations  or  if  replicated  servers  are  available  at  multiple  locations,  it  seems  natural  to  use  broadcasts 
as  a  common  mode  of  communication  involving  these  replicated  entities.  For  example,  a  new  entry  in 
a  name  table  might  be  broadcast  to  all  of  the  arobjects  that  contain  a  portion  of  that  name  table.  Each 
partially  redundant  table  fragment  could  then  be  updated  appropriately. 

4. 4. 2. 2. 3  Rationale  for  Intra-Arobject  Request  Capability 

Despite  the  number  of  communication  primitives  provided  by  ArchOS,  we  wanted  to  keep  the  model 
of  communication  among  entities  as  uniform  as  possible.  As  a  consequence,  ArchOS  does  not 
support  a  direct  process- to -process  communication  capability.  All  processes  communicate  only  with 
arobjects,  by  means  of  the  Request/Accept  mechanism,  even  v/hen  a  process  v/ishes  to 
communicate  with  another  process  in  the  same  arobject.  (This  restriction  is  a  consequence  of  the 
fact  that  we  do  not  want  the  user  of  an  arobject  to  know  about  the  implementation  of  another  arobject 
(including  the  number  of  processes  and  the  process  id’s  in  that  other  arobject).  The  arobject 
operation  invoker  can  only  see  the  specification,  not  the  implementation,  of  the  arobject  to  be 
invoked.  Therefore,  process-to-proccss  communication  between  two  distinct  arobject  instances  is 
not  possible.  And  in  the  interests  of  uniformity  and  simplicity,  wo  also  decided  not  to  allow  such 
communications  to  take  place  within  .\  single  arcbjoct  instance.) 

In  Older  to  allow  processes  in  a  sirigle  arobject  to  perform  additional  operations,  beyond  than  those 
that  are  visible  to  other  arobjects,  it  vjas  necessary  to  provide  an  additional  set  of  opiirations  that  can 
only  bo  invoked  by  the  processes  in  that  arobject.  These  operations  are  called  the  pnVate  oper^dions 
of  the  arobject,  and  they  are  invoked  in  exactly  the  same  manner  as  operations  on  otlier  arobjects. 


78 


4. 4. 2. 2. 4  Rationale  for  Invocation  Parameter  Passing 

ArchOS  Request  parameters  are  passed  to  the  receiving  arobject  using  catl-by-value  semantics. 
This  is  the  only  reasonable  way  to  pass  parameters  since  arobjects  do  not  share  any  address  space. 

The  only  exception  to  this  rule  comes  in  the  case  of  an  invocation  by  a  process  of  a  private 
operation.  In  that  case,  the  processes  can  have  intersecting  address  spaces  due  to  the  presence  of 
shared  private  data  (in  private  abstract  data  type  instances)  in  the  arobject.  As  a  result,  it  is  possible 
for  such  Request  and  Reply  messages  to  contain  references  to  common  objects  (for  example,  the 
names  of  private  data  type  instances). 

4. 4. 2. 3  System  Load  and  Initialization 

The  tradeoffs  involved  in  this  section  are  fairly  simple.  We  expect  that  the  system  would  be  loaded 
in  a  conventional  manner  from  external  storage  (e.g.  disk)  associated  with  some  of  the  nodes  in  the 
system,  and  we  expect  that  nodes  not  containing  local  external  storage  would  obtain  load  data  from 
neighboring  nodes.  We  expect  this  to  be  an  internal  ArchOS  design  decision,  which  will  therefore  be 
specified  at  a  later  time. 

The  question  to  be  answered  in  this  section  is  how  the  application  program  would  be  initialized 
once  ArchOS  is  fully  initialize-'  and  has  determined  its  own  status.  We  have  taken  the  position  at  this 
point  that  an  application  program  will  do  this  by  the  definition  of  a  unique  application  arobject  (the 
Root  arobject),  which  ArchOS  will  expect  to  find  on  one  or  more  nodes.  This  arobject  will  then 
determine  its  own  status,  including  finding  out  if  other  copies  o(  itself  exist  (or  insuring  its  own 
uniqueness  if  needed).  This  arobject  will  then  Create  the  other  arobjects  needed  to  bring  up  the 
application  program. 

This  is  a  simple  approach  v/hich  has  been  planned  to  maximize  application  program  flexibility  at 
initicliration  time,  and  to  elisnin.ate  the  need  for  operator  in'orvonlion  oilier  than  that  required  by  the 
application  itself. 

4. 1.2.4  Transactions 

ArchOS  is  a  highly  decentralized,  real  time  operating  system  designed  to  support  highly 
decentralized,  real  time  applications.  In  fact,  the  operating  system  itself  can  be  viewed  as  a  highly 
decentralized,  i>>  j|  time  applic.alion  built  on  top  of  the  kernel  support  facilities.  We  have  attempted  to 
viev/  ArchOS  in  this  way  at  various  times,  and  that  has  led  us  to  view  the  arobject  as  a  computational 
otuily  that  v/e  would  use  wiihin  ArchOS  (insofar  ns  possible),  as  well  as  at  the  application  hvel.  While 
specifying  the  ccrvice:'  to  be  ['lovirlod  by  .'ichOS  and  its  behavior  under  -all  conditions,  it  became 


79 


apparent  that  ArchOS’  internal  system  data  objects  should  possess  several  characteristics.  In 
particular,  the  following  characteristics  were  desired: 

•  Often,  several  different  arobjects  will  operate  on  specific  data  items  (directories,  queues, 
and  so  on).  These  shared  data  objects  will  reside  in  an  arobject,  so  access  can  be 
coordinated  by  the  normal  arobject  communication  facilities  (particularly  Accept 
primitives),  as  well  as  by  means  of  customized  code  in  the  arobject’s  processes.  But,  as 
the  following  points  will  illustrate,  a  higher  level  coordination  will  often  be  desirable. 

•  At  times,  it  is  necessary  to  change  several  different,  yet  related,  data  items  as  a  unit  (for 
example,  updating  all  of  the  copies  of  an  entry  in  a  partially  replicated  directory  or  moving 
an  element  from  one  queue  to  another).  If  such  atomic  updates  could  be  performed,  then 
it  would  be  much  easier  to  transform  one  consistent  state  of  a  set  of  data  items  to  another 
consistent  state. 

•  Some  data  items  must  be  permanent;  that  is,  their  state  should  be  reliably  maintained  for 
the  life  of  the  system. 

These  attributes  could  all  be  provided  by  ArchOS  by  means  of  the  communication  facilities  in 
conjunction  with  custom -written  code  and  appropriately  defined  locks  or  semaphores  and  critical 
regions.  However,  we  desired  a  more  structured  approach.  All  of  the  above  capabilities  are 
supported  by  traditional  database  transaction  systems  [Gray  77].  In  fact,  such  transaction  systems 
can  provide  even  more  powerful  properties  (specifically,  failure  atomicity  and/or  serializability).  It 
was  felt  that  this  additional  structure  would  ease  the  programmer’s  burden,  while  also  decreasing  the 
chances  for  programming  errors,  by  making  modular  programming  more  natural.  (Indeed,  the  use  of 
compound  transactions  promotes  modular  construction  of  programs  by  causing  the  transaction 
author  to  think  in  terms  of  consistency  preserving  transformations  on  sets  of  atomic  data  objects, 
thereby  causing  the  program's  data  to  be  partitioned  into  a  number  of  modular  atomic  data  sets. 
Also,  transaction  systems  in  general  can  aid  the  programmer  in  another  important  way:  the 
processing  carried  out  in  order  to  commit  or  abort  a  transaction  can  handle  a  great  deal  of  lock- 
related  bookkeeping,  even  though  the  programmer  must  explicitly  obtain  locks  on  all  of  the  atomic 
data  items.  This  frees  the  programmer  to  consider  the  correct  behavior  of  the  transactions  being 
written,  v/ithout  giving  unnecessary  considerav.on  to  interactions  v/ith  other  transactions  in  the. 
system.  Hov/ever,  this  is  not  to  say  that  the  programmer  does  not  have  to  be  concerned  at  all  with 
locks  and  locking  protocols;  rahier,  it  is  intended  to  point  out  one  aspect  of  lock  management  that  the 
programmer  does  not  have  to  handle.  Using  the  current  ArchOS  locking  protccol.s,  there  are  still  a 
number  of  decisions  concerning  locks  that  the  programmer  must  maku--foi  instance,  whether  to  use 
a  discrete  locking  protocol  or  a  tree  locking  protocol,  hov/  to  organize  the  locks  in  a  lock  tree,  what 
data  items  .ate  as30ci.a>ed  v/ith  a  given  lock,  and  so  on.  Section  4. 4. 2. 7  deals  with  some  of  tiiese 
issues  in  more  dslail.) 


80 


The  above  discussion  explains  why  a  transaction  facility  was  included  for  use  within  ArchOS.  Since 
the  ArchOS  clients  are  also  interested  in  producing  highly  decentralized,  real  time  programs,  it  was 
felt  that  it  was  appropriate  to  extend  these  primitives  to  the  clients  as  well. 

Of  course,  there  are  arguments  against  using  a  transaction  facility  within  an  operating  system.  One 
major  objection  is  that  system  performance  could  be  greatly  reduced  (as  compared  to  a  system  that 
does  not  use  a  low-level  transaction  facility).  This  is  due  to  the  fact  that  occasionally  the  system  will 
have  to  suspend  the  processing  of  a  specific  transaction  while  data  is  being  copied  from  main 
memory  to  a  (virtually)  permanent  medium;  there  will  also  be  overhead  associated  with  the  initiation 
and  conclusion  of  each  transaction.  (The  impact  of  mutual  exclusion  on  system  performance  is  not 
mentioned  in  the  preceding  discussion  since  some  form  of  mutual  exclusion  must  be  present  in  any 
system  containing  shared  data  objects,  whether  it  includes  transactions  or  not.) 

It  seems  inevitable  that  a  performance  penalty  will  be  incurred  by  the  use  of  transactions,  but  it  is 
hoped  that  the  gains  in  the  area  of  data  permanence,  system  consistency,  high  availability,  and 
reliability  will  be  worth  the  price.  However,  three  other  decisions  were  made  to  address  the  problem 
of  performance  losses  due  to  the  use  of  transactions  In  ArchOS: 

•  not  all  processing  must  take  place  within  transactions. 

•  only  specifically  designated  data  items  would  be  defined  as  afom/c-that  is,  only  those 
items  would  be  permanent,  failure  atomic,  and  so  forth.  (This  decision  forces  the 
arobject  v/riter  to  explicitly  indicate  which  data  items  must  be  atomic  and  has  a  great 
influence  on  the  types  of  steps  that  appear  within  a  transaction.  For  instance,  it  would  be 

a  questionable,  if  not  v/rong,  programming  practice  to  use  non-atomic  variables  to  pass  ' 

values  from  one  nested  transaction  to  another.  An  example  illustrating  this  point  is 
shown  in  Section  4.4.3.5.) 

a  a  new  h/pe  of  transaction,  the  compound  transaction,  is  included  to  increase  the  potential 
degree  of  system  concurrency. 

4. 4. 2. 5  Rationala  for  the  Inclusion  of  Compound  Tran.sar;tions  f 

Cfjiiipound  transactions  addressed  two  great  concern.^  regarding  the  use  of  tronsac’iuns,  both 
within  ArchOS  and  by  ArchOS  clients.  One  of  these  concerns,  performance,  has  already  been 
mentioned.  Since  ArchOS  allows  transactions  to  be  nested  arbitrarily  (interleaving  elementary  and 
compound  transactions  as  desired),  the  use  of  some  compound  transactions  can  increase  system 
concurrency.  This  is  due  to  the  fact  that  conipound  transactions  release  all  of  the  locks  that  they,  or 
.-.iny  of  their  child  transactions,  have  set  at  the  completion  of  the  execution  of  the  compound 
tr.ansaction.  Thu.s.  the  resources  that  are  coiitrollod  by  tliose  locks  are  often  free  to  be  uc ’d  by  other 
taor/  .'.SfiS  prior  to  the  completion  of  a!!  of  the  processing  a.sscciatcd  with  a  given  tiansaotiori. 


81 


The  second  concern  addressed  by  compound  transactions  in  ArchOS  is  that  of  system  integrity  and 
liveliness.  In  traditional  databeise  systems  which  support  nested  transactions,  all  of  the  locks 
obtained  by  child  (nested)  transactions  are  passed  to  their  parent  transactions  and  kept  until  the 
completion  of  the  highest  level  transaction  (at  which  time  the  transaction  is  either  aborted  or 
committed).  This  approach  was  not  suitable  for  ArchOS,  where  most  of  the  operating  system 
primitives  are  actually  expected  to  be  implemented  using  transactions.  If  a  client  transaction 
contained  operating  system  primitive  calls  which  were  implemented  using  traditional  nested 
transactions,  then  on  the  completion  of  the  primitive  call,  the  client  transaction  would  receive  any 
locks  that  the  system  primitive  transaction(s)  had  obtained  for  system  resources.  The  client  could 
subsequently  attempt  to  manipulate  the  system  resource  or  could  simply  hold  the  lock  on  the  system 
resource  for  an  arbitrarily  long  time.  Both  of  these  possibilities  were  disturbing:  but,  both  were  also 
preventable  by  proper  use  of  compound  transactions.  If  each  ArchOS  primitive  is  not  just  a 
transaction,  but  rather  a  compound  transaction,  then  no  locks  on  system  resources  will  ever  be 
returned  to  the  client  transaction.  In  this  way,  compound  transactions  are  used  in  ArchOS  to  build  a 
"firewall"  between  the  operating  system  and  the  client. 

Of  course,  compound  transactions  have  some  disadvantages  associated  with  them  as  well.  For 
instance,  it  is  not  possible  to  simply  change  an  arbitrary  elementary  transaction  to  a  compound 
transaction  without  consideration  of  recovery  issues.  (Elementary  transactions  in  ArchOS 
correspond  to  traditional  nested  transactions  in  database  systems.)  During  the  course  of  its 
execution,  each  compound  transaction  may  invoke  a  number  of  arobject  operations  and/or  operating 
system  primitives.  At  the  completion  of  the  execution  of  the  compound  transaction,  all  of  the  locks 
that  have  been  obtained  during  transaction  processing  are  released  (despite  the  fact  that  the 
compound  transaction  may  be  nested  within  another  transaction).  In  the  event  that  a  higher-level 
transaction  aborts  after  the  compound  transaction  has 'committed,  it  is  not  possible  to  guarantee  that 
all  of  the  operations  performed  by  the  compound  transaction  can  be  properly  "undone"  (in  the  sense 
of  traditional  nested  transactions).  For  inctanco,  it  is  possible  that  another  transaction  has  read,  and 
acted  upon,  data  that  represented  the  outcome  of  the  compound  transaction  after  it  had  committed 
(and  thereby  released  all  of  its  locks),  but  prior  to  the  execution  of  its  compensation  action.  Such  a 
situation  could  never  arise  in  a  traditional  transaction  system,  but  it  ccrtaiiily  could  happen  in  the 
ArchOS  transaction  system. 

The  ArchOS  compensation  action  for  a  con.mitted  compound  transaction  consists,  in  part,  of  the 
execution  of  a  set  of  compensation  operations  assou'‘'',i''d  with  the  arobject  operation  invocations  and 
operating  sysk^m  primitive  invocations  made  during  ti  ■^.  course  of  execution  of  tiie  compound 
transaction.  While  these  compensation  oj/eralion.s  may  to  appi  oxiinate  tho  '.ffocts  of  th<) 


82 


traditional  transaction  "undo"  operations,  they  cannot  guarantee  that  the  compensation  will  result  in 
the  same  system  state  as  would  have  resulted  if  only  nested  elementary  transactions  been  used. 
Rather,  the  system  state  is  transformed  to  a  state  that  is  equivalent  to  the  state  that  would  have 
resulted  if  all  of  the  other  concurrent  transactions  in  the  system  had  been  processed  in  the  absence 
of  the  aborted  compound  transaction.  (See  Section  4.2.4. 1  for  additional  discussion  of  this  point.)  If 
such  compensation  operations  can  be  constructed  and  the  weaker  guarantees  concerning  the 
system  state  in  the  case  of  the  abortion  of  a  higher-level  transaction  are  acceptable  to  a  transaction 
author,  then  compound  transactions  can  be  used  for  a  given  application;  however,  if  these  conditions 
are  not  sufficient,  then  elementary  transactions  must  be  used. 

A  few  examples  can  be  used  to  demonstrate  cases  in  which  compound  transactions  are  or  are  not 
appropriate  based  on  the  ability  of  compensation  actions  to  provide  the  required  semantics. 

First,  consider  a  case  in  which  compound  transactions  are  appropriate;  the  dequeue  operation  of  a 
weak  queue.  In  such  a  queue,  the  first-in-first-out  ordering  of  elements  in  a  strong  queue  is 
weakened:  it  is  acceptable  to  alter  the  order  in  which  queue  elements  are  removed  from  the  queue  by 
the  dequeue  operation.  As  a  result,  it  is  possible  to  dequeue  elements  by  means  of  a  dequeue 
operation  based  on  a  compound  transaction.  This  operation  simply  returns  tho  head  element  of  the 
queue  and  then  releases  any  locks  obtained  in  the  process  of  dequeuing  that  element.  The 
compensation  operation  associated  with  this  dequeue  operation  is  also  quite  simple:  the  element  that 
was  previously  dequeued  is  returned  to  the  head  of  the  queue.  Because  of  the  semantics  of  the  weak 
queue,  these  operations  are  acceptable.  It  is  unimportant  whether  or  not  any  other  dequeue 
operations  occurred  during  the  interval  between  an  element  being  dequeued  and  subsequcritly  being 
requeued  because  a  strict  ordering  of  the  queue  elements  is  not  required. 

Second,  consider  a  case  in  v/hich  a  compound  transaction  is  inappropriate:  the  enqueue  operation 
of  a  weak  queue.  In  this  case,  there  is  a  visibility  problem--fhat  is,  if  coriipouiid  tiarisactions  'vero 
used  to  implement  the  enqueue  operation  on  a  weal^  queue,  it  would  be  [jossible  for  other 
transactions  to  view  the  queue  in  st.ates  that  represent  partial  results  of  computations  th:U  are 
subsequently  aborted.  As  a  result  of  viewing  such  states,  it  is  posoiblo  that  those  transactions  will 
alter  tho  state  in  an  inappropriate  way.  This  situation  is  illustrated  by  the  use  of  a  compound 
transaction  to  implement  the  enqueue  operation  for  a  weak  queue.  Such  an  enqueue  operation 
v;ould  take  an  element  passed  to  it  and  append  it  to  the  tail  of  the  r;ueue,  releasing  any  locks  obi  lined 
at  the  completion  of  the  operation.  The  most  obvious  compeii.s.',{ion  oporatiori  for  this  en,,((ouo 
operation  would  looata  Iho  desired  cfeav.nt  in  too  iiuoui;  and  remra'.p  it,  thereby  ag  .'aipfing  :,a  it 
appear  a'-,  ihouyli  it  liail  never  bc.er,  ilion?.  Mowev.-'r,  thi:',  is  not  .a  sulf'Ciont  comp  ;’i..;ation  acivi.i  .ii/ic.e 


83 


it  is  possible  that  an  element  may  be  enqueued  and  later  dequeued  before  the  compensation  action  is 
able  to  be  executed.  In  such  a  situation,  the  only  way  to  provide  the  required  semantics  for  the  weak 
queue  is  to  abort  the  transaction  that  dequeued  the  element.  Yet  this  presents  the  possibility  of 
cascading  aborts,  and  ArchOS  cannot  permit  cascading  aborts  to  occur.  Due  to  this  visibility 
problem,  compound  transactions  are  not  appropriate  for  the  implementation  of  the  enqueue 
operation. 

Another  disadvantage  associated  with  the  use  of  the  compound  transaction  is  also  related  to  the 
compensation  mechanism;  programmers  must  explicitly  write  the  compensation  operations 
corresponding  to  the  arobject  operations  for  a  given  arobject.  The  concept  of  this  type  of  transaction 
is  quite  new,  and  we  are  not  yet  certain  about  the  nature  of  the  actions  to  be  performed  by  a  typical 
compensation  routine.  (In  fact,  there  are  many  issues  that  we  do  not  fully  understand  with  respect  to 
compound  transactions;  the  number  and  variety  of  applications  for  compound  transactions,  the 
amount  of  work  required  to  define  appropriate  compensation  actions,  the  form  such  actions  should 
take,  the  impact  of  compensation  actions  on  the  ability  of  ArchOS  to  make  guarantees  about  real  time 
behavior,  the  level  of  concurrency  that  can  be  achieved  using  compound  transactions,  and  so  on.) 
So  at  this  point,  we  have  decided  that  ArchOS  will  initially  support  only  programmer-coded 
compensation  actions.  (This  should  be  contrasted  with  the  case  of  traditional  nested  transaction 
database  systems  or,  equivalently,  nested  ArchOS  eiementan/  transactions.  In  these  cases,  all  of  the 
"undo"  or  "redo"  types  of  operations  are  determined  and  performed  by  the  transaction  facility  for  any 
user-written  transaction.)  However,  ArchOS  will  provide  support  to  automatically  compose  these 
basic  compensation  actions  in  order  to  facilitate  the  construction  of  arbitrary  higher-level 
transactions  (either  compound  or  elementary  transactions)  that  invoke  arobject  operation;?  based  on 
lower-level  compound  transactions. 

4. 4. 2. 6  Rationale  for  the  T ransaction  Syntax 

Two  potential  formats  were  considered  for  tho  syntactic  flennition  of  tians,  ctions:  an  iri-lir.o  format 
(in  which  the  transaction  would  be  delimited  in  the  body  of  the  ;>urioundin(j  text  by  some  keywordc) 
and  a  procedural  format  (in  which  a  transaction  was  defined  as  a  separate  cniity-  such  as  a  function 
or  procedure- -and  was  "called"  by  the  transaction  initiator). 

Tfiero  was  no  ovorv^helniing  reason  for  choosing  one  format  over  the  ether.  This  issue  cecmed  to 
bo  largely  one  of  stylistic  preference,  not  performance  or  functionality.  Assuming  that  a  typical 
t'nn.saction  is  relatively  st  art,  the  in  line  format  has  the  advantage  of  shov/ing  tlie  actual  trsns.action 
•eten;'.  to  a  reader  of  the  code;  on  ttv*  other  hanrl,  the  procoderal  format  con'd  l;e  paratn.'tetr.'.r.'.l  in  the 
hope  cf  avoidiiiii  the;  iluplicalii  a  of  definitions  that  might  oo'-.nr  with  Ihn  in  li.:::'  format.  (Iliis  .Sv...'ms  to 


84 


parallel  the  arguments  for  using  macros  or  subroutines  in  a  given  application.)  Our  preference  was  to 
use  the  in-line  format  since  it  appeared  to  be  more  readable  and  compact. 


Once  the  determination  to  use  an  in-line  format  was  made,  we  needed  to  pick  a  specific-format.  We 
felt  strongly  that  transactions  should  not  span  multiple  arobjects-that  is,  a  transaction  should  not  be 
initiated  by  one  arobject  and  later  completed  by  another  arobject.  Since  the  arobject  is  the  basic  unit 
of  program  construction,  transactions  that  span  arobjects  hardly  seem  to  promote  modular  program 
construction  techniques.  In  fact,  we  felt  that  a  transaction  should  begin  and  end  in  a  single  process. 
The  reasoning  for  this  decision  is  a  simple  extension  of  the  argument  previously  given  for  requiring 
the  transaction  to  begin  and  end  in  a  single  arobject.  The  and  CT{i..}  syntactic  structures 

selected  for  use  in  ArchOS  force  a  transaction  to  begin  and  end  in  a  single  process,  while  other 
possible  structures  (such  as  arbitrarily  placed  BeginTransaction  and  EndTransaction  delimiters)  did 
not. 

4. 4. 2. 7  Rationale  for  Lock  Support  Decisions 
Several  important  decisions  were  made  with  respect  to  the  support  to  be  provided  to  the  client  in 
terms  of  obtaining  locks  on  shared  data  objects.  This  portion  of  the  rationale  will  deal  with  three  of 
the  most  important  decisions;  (1)  the  decision  to  support  both  a  discrete  locking  protocol ’and  a  tree 
locking  protocol  [Silberschatz  80];  (2)  the  decision  that  the  client  must  explicitly  set  locks ;  and  (3)  the 
decision  to  allow  the  client  to  explicitly  release  locks  within  a  transaction. 


ArchOS  supports  both  discrete  locks  and  tree  locks  because  we  feel  that  each  type  of  lock 
addresses  a  different  set  of  client  needs,  and  neither  type  alone  addresses  all  of  these  needs.  For 
instance,  the  tree  lock  is  supported  because  it  is  able  to  make  an  important  guarantee  about  the 
nature  of  the  computations  that  use  exclusively  tree  locks  from  a  single  lock  tree:  if  a  computation,  C, 
obeys  the  tree  lock  accessing  rules  and  if  all  of  the  other  computations  that  obtain  locks  in  that  tree 


release  them  in  a  finite  length  of  lime,  ilien  conipuicilion  C  will  also  comrjieto  in  finite  time  witliou!  'h3 
occurance  of  deadlocks.  We  believe  that  the  guarantee  that  a  computation  ’-vill  t.-ike  place  witimut  the 
possibility  of  a  deadlock  is  extremely  important  and  justifies  the  support  of  tree  locks  in  ArchOS. 


However,  tree  locks  cannot  be  the  only  locking  mechanism  in  ArchOS.  in  defining  the  tree  structure 
to  be  used  in  connection  with  tree  locks,  the  client  is  explicitly  specifying  the  legal  access  patterns  for 
locks  in  (fie  tree.  ThLs  may  be  a  straighiforward  process  '.vhen  the  client  is  rioaling  vvith  a  small 
collection  of  reiated  data  items,  but  appears  to  be  intractable  wlien  dealing  with  all  of  Uie  locks  in  the 
systotn.  (If  it  v/ere  desired  to  guarantee  that  the  entire  .cy.stein  would  be  deadlock- fi tlien  it  would 
be  neceocary  to  place  ;ill  of  tlie  locks  in  ilto  system  in  a  simjle  lock  tr.co.  Specifying  a  rat'onal  iice 


85 


structure  for  such  a  large  number  of  often  loosely  related  items  seems  impossible.)  This  leads  us  to 
support  discrete  locks  as  well  as  tree  locks.  (Actually,  ArchOS  will  provide  support  for  multiple  lock 
trees  defined  by  the  client,  as  necessary.) 

In  fact,  we  could  build  ArchOS  without  discrete  locks,  using  a  forest  of  tree  locks.  Yet,  once  it  was 
decided  that  we  could  not  include  all  of  the  locks  in  a  single  monolithic  tree,  it  seemed  to  be 
worthwhile  to  allow  the  more  traditional  discrete  lock  to  be  used  as  well.  Presenting  the  view  to  the 
client  that  each  discrete  lock  is  actually  a  degenerate  (single  node)  tree  lock  seemed  to  be 
unnecessarily  complicated.  (Although  that  may  be  the  manner  in  which  discrete  locks  are  actually 
implemented.) 

The  next  major  point  to  be  discussed  is  the  justification  for  the  rule  that  a  client  must  explicitly 
obtain  all  of  the  locks  needed  to  perform  a  given  computation.  This  decision  was  reached  for  two 
reasons.  First,  a  system  that  would  handle  the  acquisition  of  locks  automatically  would  be  at  or 
beyond  the  state-of-the-art  for  database  systems.  Since  this  is  not  related  to  ArchOS’  prime  research 
goals,  we  would  prefer  not  to  expend  the  effort  that  such  a  capability  would  require.  Second,  even  if 
we  had  an  automatic  lock  acquisition  facility,  it  seems  inevitable  that  the  system  would  be  more  prone 
to  deadlocks  than  if  the  client  explicitly  requested  the  locks.  This  is  due  to  the  fact  that  the  automatic 
system  would  often  have  to  upgrade  a  "read"  lock  to  a  "write"  lock  during  the  course  of  a 
transaction.  The  attempt  to  upgrade  the  lock  could  lead  to  a  deadlock  situation,  which  might  have 
been  avoided  if  a  "write"  lock  had  been  requested  irj  the  first  place.  Although  the  automatic  system 
would  have  no  way  of  knowing  that  a  "write”  lock  v/ould  eventually  be  needed,  the  client  who  wrote 
the  program  would  indeed  have  knov/n  and  could  have  avoided  the  situation  in  many  cases. 

The  final  major  point  to  be  addressed  concerning  locks  is  the  use  of  the  BeleascLock  primitive-- 
specifically,  a  justification  for  the  use  cf  this  primitive  within  a  transaction. 

Although  the  use  of  the  ReloascLock  primitive  within  a  traasaction  can  violate  the  locking  protocols 
of  the  transaction  system,  it  can  also  be  used  in  a  manner  con.sistent  witli  tlioce  rules.  In  [jarticular,. 
tree  locks  can  bo  released  during  the  course  of  a  transaction  if  they  are  not  needed  for  the 
transaction  computation.  For  example,  if  a  given  tree  lock  ware  obtained  only  to  lock  some  of  its 
descendants  in  the  tree,  then  that  lock  could  be  released  after  the  locks  on  ih.a  descendants  have 
been  obtained,  v/ilh  no  undesirable  effects.  Also,  it  is  poc-.ible  that  explicit  releasing  of  locks  irii.glit  be 
useful  in  taking  full  advantage  of  Sha's  notion  of  setwise  -serioliability  [Gha 

A  few  (jthcr  minor  items  concerning  locking  i -.sue.s  siioulrl  Inj  mentior.cd.  liio  ■.yntox  ;  he  s.  n  her 


86 


declaring  locks  in  an  arobject  is  similar  to  the  syntax  used  in  declaring  variables  (with  types 
DISCRETE  -  LOCK  -  ID  or  TREE  -  LOCK  -  ID).  However,  the  tree  structure  of  the  tree  locks  is  defined 
dynamically  by  means  of  the  CreateLock  and  DeleteLock  primitives. 

Also,  we  have  not  yet  made  a  final  decision  concerning  the  "strength"  of  the  connection  between  a 
lock  and  the  data  item  it  is  associated  with,  if  locks  are  closely  associated  with  the  data  item,  then  the 
system  can  perform  a  number  of  checks  to  guarantee  that  the  locks  are  being  used  properly. 
However,  if  this  bond  is  weaker,  then  the  client  has  more  freedom  to  associate  locks  with  more 
general  or  more  abstract  data  items  or  even  facilities.  (For  example,  the  client  could  obtain  a  lock  on 
an  arbitrary  character  string  which  may  not  correspond  to  any  data  item  at  the  time  the  lock  is 
obtained.  This  capability  might  be  more  difficult  to  provide  if  locks  are  tightly  associated  only  with 
existent  data  items.)  In  this  case,  though,  the  client  must  use  a  programming  convention  in  order  to 
guarantee  that  the  locks  are  used  properly  in  accessing  the  data.  These  two  examples  are  the  end 
points  of  a  range  of  possible  lock  strengths.  We  have  yet  to  decide  where  in  this  range  of  possibilities 
the  ArchOS  supported  locks  should  lie. 

Rnally,  ArchOS  does  provide  a  client-specified  timeout  parameter  to  bcund  the  execution  time  of  a 
given  transaction.  This  provides  a  crude  facility  to  prevent  deadlocks  from  tying  up  the  transaction 
for  an  arbitrarily  long  time.  ArchOS  will  not  necessarily  detect  a  deadlock  condition  for  the  client;  the 
ultimate  responsibility  for  handling  the  possibility  of  a  deadlock  belongs  to  the  client,  Hov/ever, 
ArchOS  will  provide  some  deadlock  detection  facilities.  The  exact  nature  of  these  facilities  has  not 
yet  been  fully  determined,  but  some  candidate  facilities  are:  ArchOS  will  detect  all  deadlock  cycles  of 
length  two  (or  perhaps  three),  or  .ArchOS  v/iil  detect  all  deadlocks  that  involve  the  resources  of  only  a 
single  node.  Tiie  level  of  deadlock  detection  service  provided  will  be  strongly  influenced  by  the  cost 
of  providing  that  service.  Since  we  will  not  be  able  to  detect  all  deadlock  conditions,  we  v/ill  only 
perform  that  deadlock  analysis  that  is  rclaMvely  inexpensive,  while  still  capable  of  delecting  come  of 
the  more  common  situations.  In  the  event  tiiat  a  deadlock  is  dotcitod,  ArciiO-S  v/iii  .cLort  transactions 
as  required  iri  order  to  allov,/  ptoces.sing  to  continue  v/ithout  deadlock,. 

4. 4. 2. 8  Ration.'ile  for  Inoic.iion  of  Critical  ncgions  in  ArchOS 

ArchOS  supports  critical  regions  to  assure  exclusive  access  to  sh.arod  data  items  within  a  single 
arobject  by  means  of  the  Region  primitive.  However,  it  may  be  noted  that  ArchOS  al.so  supports 
auolher,  more  secure,  method  of  obtaining  mutually  exclusive  acce.ss  to  shared  d.,.ta  itevns:  tire 
transaction. 


critical  regions  are  also  supported.  In  fact,  the  main  reason  is  that  critical  regions  do  not  require  ail  of 
the  powerful  facilities  that  a  transaction  supplies,  whether  they  are  needed  or  not.  It  was  decided  that 
no  matter  how  efficient  the  ArchOS  transaction  facility  was,  it  would  still  probably  be  slower  than  a 
mechanism  that  only  provides  a  small  fraction  of  the  power  of  a  transaction  (even  a  compound 
transaction,  which  would  typically  involve  less  processing  than  an  elementary  transaction).  As  a 
result,  the  critical  region  can  be  used  to  provide  a  simple  mutual  exclusion  mechanism  that  will  often 
be  needed  when  accessing  shared  data  objects  for  which  the  failure  atomic,  permanent,  or 
serializable  properties  of  transactions  are  not  required. 

4. 4. 2. 9  Rationale  for  T ransaction  Nesting  Rules 
ArchOS  allows  both  elementary  and  compound  transactions  to  be  nested  arbitrarily  within  a 
transaction.  Such  interleavings  are  necessary  for  proper  system  behavior.  This  will  be  demonstrated 
by  a  number  of  examples  that  point  out  the  necessity  of  each  type  of  nesting; 

•  In  order  to  provide  the  traditional  database  nested  transaction  view,  it  is  necessary  to 
allow  the  nesting  of  elementary  transactions  within  elementary  transactions. 

•  Where  failure  atomicity  (the  property  by  which  either  all  of  the  actions  of  a  transaction  are 
performed  or  none  are)  and  visibility  issues  (as  mentioned  in  Section  4.4.2.5)  are  not  of 
prime  concern,  system  performance  can  be  maximized  by  employing  nested  compound 
transactions  rather  than  nested  elementary  transactions. 

•  In  cases  where  compensate  routines  can  provide  similar  functionality  to  more  traditional 
"undo"  operations  on  locked  data  items  (for  example,  consider  the  case  of  compensating 
for  the  allocation  of  a  new  page  of  memory  from  the  operating  system),  it  would  be 
advantageous  for  performance  reasons  to  use  compound  transactions  nested  within 
elementary  transactions. 

•  Consider  the  case  in  which  a  compound  transaction  must  perform  a  certain  computation. 

It  is  certainly  reasonable  to  believe  that  it  might  often  be  carried  out  by  a  set  of  nested 
elementary  transactions,  thereby  giving  all  of  the  automatic  failure  recovery  and  visibility 
features  of  nested  elementary  iiansactions  to  the  do.3ired  coniputaiion.  It  is  of  no  interest 
to  the  lov/er  nested  levels  that  their  locks  will  all  be  released  as  soon  as  ii\e  compound 
transaction  commits  (or  aborts).  (Mole  that  the  compound  Iransaction  in  this  cfi.se  acts  a 
great  deal  like  itie  top-level  tran.seiclion  of  a  set  of  no.sted  elementary  tran.iactions.) 

Since  all  of  these  cases  seemed  to  be  useful,  it  was  established  that  ArchOS  transactions  could  be 
constructed  by  any  at  bitrary  mixture  of  the  two  lype.s  of  transactions. 

.Such  mixtures  of  the  two  transaction  types  can  be  quite  useful  in  selectively  fiassing  the  locks  of 
some  child  transactions  to  a  parent  tratisaction  v/liile  releasing  tho.so  of  other  child  transactions.  This 
ability  to  selectively  pass  locks  is  recjuircd  since  the  .ArchOS  primitives  will  usually  contain  co.’nriound 
tran.sactio.c  (in  order  to  prevent  p;v;.cing  .system  resource  locks  tc  clients),  yet  eliont  eli'ineniary 


88 


transactions  may  be  built  by  making  Requests  for  services  from  other  client  arobjects.  If  a  Request 
were  simply  a  compound  transaction,  and  the  computation  that  resulted  from  the  Request  execution 
were  considered  to  be  a  child  of  the  Request  transaction,  then  none  of  the  requestee’s  locks  would  be 
returned  to  the  requestor  (client)  due  to  the  nature  of  a  compound  transaction. 


(ET  orCT) 


Figure  4-4:  Solectivs  Lock  Passing  by  Ihc  Request  Primitive 

A  mixture  of  elementary  and  compound  transactions  can  be  used  in  tf)e  above  example  to  pass  the 
requestor  the  requestee’s  locks  while  releasing  liie  system  resource  locks  at  the  completion  of  the 
Request  primitive  execution.  (CSee  Figure  4-4.)  By  implementing  the  Request  prin\ihve  as  an 
olemontary  transaction  v/ith  several  child  transactions,  the  desired  effect  can  be  achieved.  The  locks 
obtained  by  the  roquestoo’.s  computation  are  ;,uto  “  iticully  passed  back  to  the  requestor;  and,  the 
.'^ystc.m  acticnr;  (Reql  find  R'.qr?  in  the  figuii  )  are  oncapsulaicd  v/ithin  child  compound  Ji.iiia.'.Ctioiis, 


89 


so  the  requestor  will  not  receive  any  of  the  system  resource  locks  (since  they  are  not  returned  to  the 
Request  primitive’s  highest  level  elementary  transaction.) 

4.4.2.10  Rationale  for  Inclusion  of  the  AbortIncompleteTransaction  Primitive 
The  AbortIncompleteTransaction  primitive  provides  a  unique  capability  in  handling  responses  from 
child  transactions. 


To  understand  how  such  a  primitive  facility  could  be  used,  consider  the  following  situation:  if  a 
transaction  performs  a  RequestAll  invocation  for  some  operation  on  all  of  the  arobject  instances  of  a 
certain  type,  then  each  invocation  will  be  serviced  and  appropriate  replies  will  be  returned.  Also, 
suppose  that  while  servicing  the  request,  each  arobject  instance  performs  some  transaction 
processing.  And  finally,  suppose  that  the  requestor  is  only  interested  in  the  first  reply  to  the 
RequestAll  invocation.  (Although  this  scenario  may  sound  contrived,  it  is  not.  This  is  exactly  the 
manner  in  which  a  client  would  attempt  to  obtain  the  shortest  response  time  from  a  set  of  arobjects, 
all  of  v;hich  are  capable  of  servicing  the  client's  request.  The  client  would  simply  request  that  all  of 
the  arobjects  provide  the  required  service  (by  means  of  the  RequestAll  primitive)  and  then  wait  until 
the  first  reply  is  obtained.  Since,  at  that  point,  the  actions  of  the  rest  of  the  servers  are  no  longer  of 
interest  to  the  requestor^their  actions  can  be  terminated.) 


After  receiving  the  first  reply,  the  client  could  not  correctly  proceed  without  the 
AbortIncompleteTransaction  primitive.  This  is  due  to  the  fact  that  the  computations  carried  out  by  the 
ai  object  instances  that  respond  (or  would  re.spond)  after  the  first  should  be  aborted  (including  the 
transaction  processing  that  they  are  carrying  out).  However,  the  client  has  no  way  of  knowing  the 
identity  cf  those  arobject  instances  that  received  the  request  since  the  RequestAll  primitive  was  used 
and  therefore  Uie  exact  identities  of  the  acceptors  were  never  explicitly  known  by  the  requestor.  If  the 
cl'ont  aborts  all  of  the  processing  associated  with  that  RequesiAli  primitive,  then  the  work  done  by  the 


first  tC(>iier  v/iil  bo  lost  as 
allovrs  ArchOS  to  handle  : 
well  as  periorming  the  m 


v/'jII.  The  AUvrllficomplutuTronsiicHoii  primitive  I'landies  this  case  since  it 
ill  of  the  bookkeeping  associated  with  the  identities  of  the  acceptors  and  as 
s.ssary  selective  aborts. 


Figure  4-5  shows  the  transaction  tree  .structure  a-ssociated  with  the  primitive.  This  figure 

v;iil  illuslrale  the  actions  taken  by  the  AbortlnccmpleinTrumyacticn  piimitive  with  respect  to  the 
tiaiiiacficn  tree.  The  RequusiAll  tr.nrpiacllon  has  oovoral  children  transactions,  one  for  cnch  of  the 
acerrptors  u'l  the  request  fur  sei  vico.  At  the  cumplotion  of  eaclt  .server’s  Heply,  the  su'oiransaction 
as.soc'atcd  'wilii  that  server  is  concluded  (comrnitteil).  The  /\hor;!ii':or,};;tctcl'r;:n:'ac.(icn  primitive 
aborts  all  !•:  '.’i.j  child  (i aneactions  of  li.c  nns'ed  fr.',iri.'.'’ciion  (/7'’on:  ■■■  ■'■II,  in  Ihi;  th.at  have  not 


90 


Figu re  4-5:  flcQuesM//  Transaction  Tree 

yet  reached  that  commit  point,  in  this  way,  the  work  that  has  been  done  will  be  maintained,  while  the 
incomplete  work  will  be  aborted. 

This  primitive  has  proven  quite  ucoful  in  oxarrple  programs,  inciuding  liiy  pailially  replicated 
distributed  directory  example  contained  in  Appendix  A  of  this  document. 

4.4,2.11  Recil-Time  Facilities 

ArchOS  is  designed  to  support  user-dofined  real  time  deadlines,  but  its  support  for  .'•eal-time 
systems  differs  fundamentally  frofii  other  real-time  operating  systems.  There  are,  of  course,  many 
exisling  real  time  systems  being  used  in  a  number  of  environments  including  the  military,  process 
control,  and  robotics.  It  is  interesting,  however,  to  observe  some  of  the  characteristics  found  in  the 
operating  systems  (or  executives,  as  they  are  somr'iimes  called  when  a  full  operating  system  is  not 
impleitionterl)  designed  for  those  syslomo  which  provide  support  for  real  time  operation.  I'wo  primary 
char.K:icristic,s  can  ho  deecribetl; 


91 


•  These  operating  systems  are  kept  simple,  with  minimal  overhead,  but  also  with  minimal 
function.  Virtual  storage  is  almost  never  provided;  file  systems  are  usually  either 
extremely  limited  or  non-existent.  I/O  support  is  kept  to  an  absolute  minimum. 
Scheduling  is  almost  always  provided  by  some  combination  of  FIFO  (for  message 
handling),  priority  ordering,  or  round-robin,  with  the  choice  made  arbitrarily  by  the 
operating  system. 

•  Simple  support  for  management  of  a  hardware  real-time  clock  is  provided,  with  facilities 
for  periodic  process  scheduling  based  on  the  clock,  and  timed  delay  primitives. 

Conspicuously  missing  from  these  systems  at  the  operating  system  level  is  any  specific  support  for 
managing  user-defined  deadlines,  even  though  meeting  such  deadlines  is  the  primary  characteristic 
of  real-time  application  requirements.  Instead,  these  systems  are  designed  to  meet  their  deadlines  by 
ensuring  that  the  available  resources  significantly  exceed  the  actual  user  requirements,  and  the 
implementation  is  followed  by  an  extensive  testing  period  to  verify  that  this  assumption  is  maintained 
under  "normal"  loads.  In  priority-driven  systems,  deadlines  are  handled  by  assigning  a  high  fixed 
priority  to  processes  with  critical  deadlines,  disregarding  the  resulting  impact  to  less  critical 
deadlines. 

The  ArchOS  primitives  have  been  designed  to  provide  the  information  which  will  enable  the 
scheduling  function  to  sequence  processes  according  to  their  deadlines,  ensuring  that  all  deadlines 
are  met  in  each  node  as  long  as  there  are  sufficient  resources  to  meet  them.  Techniques  for  handling 
such  deadlines  are  well  known,  given  sufficient  resources. 

An  important  ArchOS  research  area,  however,  is  the  handling  of  these  deadlines  when  there  are 
insufficient  resources  to  handle  all  of  them.  User  policies  defining  potential  objective  functions  for 
these  cases  will  be  accepted,  and  a  best  effort  will  be  made  to  implement  these  policies  when 
resources  are  insufficient.  Essentially  two  primary  decisions  are  required  in  these  cases: 

•  Which  deadlines  should  be  missed,  and  by  how  much?  That  is,  should  some  deadlines 
take  precedence,  or  should  missed  deadlines  be  fairly  distributed?  Should  some 
processes  be  aborted  if  their  de.adlines  cannot  be  met? 

o  At  what  point  should  some  processes  be  migrated  to  other  nodes?  If  so,  how  many 
processes,  and  which  processes  should  be  migrated? 

4.4.2.12  Policy  Definitions 

V/e  fell  from  the  outset  that  an  application  running  undor  ArchOS  would  need  a  great  deal  of 
flexibility  in  order  to  meet  its  specifications.  Therefore,  we  decided  to  support  applieaiion  defined 
policies  to  manage  certain  rf'sources.  At  this  point,  wc  have  limited  the  policy  dcfiriiiion  cnriobilily  to 
two  specific  management  tasks;  process  .scheduling  (initially  on  n  .single  node)  arid  proc-ss 


92 


reconfiguration  (process  migration  or  relocation)  among  nodes.  This  decision  was  made  because  we 
did  not  feel  that  there  was  a  global  programming  paradigm  involving  application-defined  policies  that 
we  wished  to  impose  on  the  application  program  writer;  currently,  we  treat  each  facility  on  its  own 
merits  with  respect  to  the  use  of  application-defined  policies.  We  do  have  an  overall  framework  within 
ArchOS  for  the  support  of  application-defined  policies,  but  we  would  like  to  demonstrate  the 
suitability  of  our  ideas  on  a  few  test  cases  before  spreading  the  approach  throughout  the  entire 
system. 

In  addition,  process  scheduling  and  process  reconfiguration  in  a  distributed  system  are  of  particular 
interest  to  the  real-time  application  programmer,  who  is  often  in  the  best  position  to  determine  the 
high-level  scheduling  policies  of  the  system.  This  is  due  to  the  fact  that  the  actual  deadline 
constraints  that  the  system  must  satisfy  are  imposed  by  the  external  environment  and  the  nature  of 
the  application  processing.  By  allowing  the  application  programmer  to  define  the  scheduling  policy,  it 
is  expected  that  the  system  will  better  be  able  to  carry  out  the  process  scheduling  task  with  these 
factors  in  mind,  particularly  in  situations  where  there  are  insufficient  processing  resources  to  carry 
out  all  of  the  applications  functions. 

Another  area  of  interest  within  the  Arc'nons  project  concerning  the  use  of  policy  within  a  facility  is 
the  separation  of  the  policy  and  mechanism  portions  of  the  facility.  This  work  is  similar  in  spirit  (if  not 
in  implementation)  to  the  policy/mechanism  separation  work  done  for  the  Hydra  system  [Wulf  81]. 
However,  for  the  most  part,  Arc'nOS  will  not  focus  much  attention  on  that  particular  aspect  of  policy 
definition  for  a  facility.  This  is  partially  due  to  the  fact  that,  in  many  cases,  it  is  very  difficult  to  draw 
the  line  between  those  protions  of  a  facility  that  are  policies  and  those  that  are  mechanisms.  And 
since  drawing  that  line  is  not  a  central  focus  of  the  ArchOS  development  effort,  we  feel  that  we  should 
not  spend  a  great  deal  of  time  pursuing  this  task.  (We  will  indeed  attempt  to  separate  policy  and 
mechanism  insofar  as  possible  in  ArcliOS,  but  we  will  not  push  this  idea  very  hard.) 

Once  we  had  decided  to  provide  support  for  application  defined  facility  policies  and  to  make  an 
attempt  to  separate  policy  and  mechanism  in  the  implementation  of  the  facility,  there  were  still  a 
number  of  open  questions. 

First,  wo  lecognized  that  there  is  a  spectrum  of  choices  in  tho  amount  cf  flexibility  to  be  provided  to 
the  application  writer.  One  approach  to  the  specification  of  a  facility’s  services  would  iiivolve:  (1) 
dotorifiiriiny  the  set  of  all  of  the  policies  that  might  ever  be  used  to  manage  the  facility,  (2)  selecting  a 
sot  o!  mechanic;  1C  that  c.an  be  combinod  in  varinus  ways  by  tint  cot  of  policies  to  provide  any  of  tho 
pf evi.anciy  'rmntifi- d  facilities,  and  (0)  m.al.iny  those  irieclinni:;iri3  available  to  the  application 


93 


programmer  to  construct  the  facility.  While  this  scheme  allows  the  application  writer  a  great  deal  of 
freedom,  it  is  possible  that  it  might  be  very  difficult  to  actually  carry  out  the  first  two  steps  listed  above. 
A  second  approach  limits  the  options  of  the  application  writer  (and  so  might  be  more  secure  and 
reliable  from  the  operating  system's  point  of  view)  while  still  allowing  the  client  to  select  the  policy  to 
be  used  in  managing  a  facility.  In  this  case,  the  application  programmer  would  be  able  to  select  the 
facility  policy  from  a  limited  menu  of  policies.  (Essentially,  the  programmer  has  been  given  a  policy 
mode  switch.)  The  programmer  would  not  necessarily  even  know  the  mechanisms  that  actually 
underlie  the  policies  in  this  case.  For  process  scheduling  when  processing  resources  are  not 
sufficient  to  meet  processing  demands,  we  have  chosen  a  scheme  that  lies  somewhere  between  the 
extreme  cases  listed  above.  We  intend  to  design  a  policy  scheduling  facility  that  has  a  fixed  set  cf 
mechanisms  that  the  client  cannot  directly  access.  Some  of  these  mechanisms  will  evaluate  value 
functions  in  order  to  compute  an  optimal  (or  near  optimal)  schedule  for  processes  on  a  given  node. 
The  value  functions  that  correspond  to  many  of  the  well-known  scheduling  approaches  will  be 
available  to  the  client  (in  a  form  similar  to  a  library).  However,  the  client  may  also  specify  the 
parameters  for  a  certain  class  of  value  functions  and  submit  a  value  function  to  the  process 
scheduling  facility  to  determine  the  exact  policy  to  be  followed.  Although  this  approach  does  not 
allow  the  client  to  specify  an  arbitrary  value  function  to  the  process  scheduler,  it  does  allow  much 
more  freedom  than  choosing  from  a  small  menu  of  well-known  policies  (such  as  "missing  fewest 
deadlines"  or  "minimize  maximum  lateness"). 

Second,  there  are  a  number  of  places  where  application-defined  policies  and  policy/mechanism 
separation  can  be  used.  V/e  considered  the  possibility  that  we  could  have  two  levels  of  .application- 
defined  policies:  policies  that  affected  the  entire  distributed  program  and  policies  th.at  affected 
individual  arobjects.  We  believe  that  there  are  cases  where  both  of  these  alternatives  make  sense. 
However,  neither  process  .scheduling  nor  process  reconfiguration  seems  to  have  a  co.mpelling  use  for 
arobject-level  policies.  As  a  result,  the  policy  definition  capabilities  Jo.scribed  in  this  document  are 
concerned  only  v.'ith  policies  lh.at  affect  the  entire  distributed  application  program.  Perhaps  we  will 
pursue  multiple  levels  of  policy  in  a  later  version  of  ArchOS. 

There  is  another  sense  in  v/uich  multiple  levels  of  policy  may  be  (jiscusced  -■  a  hierarchical  sense.  A 
facility  at  a  giv'ri  level  of  abstract  may  be  decomposed  into  a  .set  of  mech.n'Mcms  that  arc  manipulated 
by  means  of  a  sol  rjf  policy  statements.  Each  of  the  mechanisms  at  that  level  cl  abstraction,  in  turn, 
are  also  decomposable  at  a  lower  love!  of  abstraction  into  a  set  nl  Iowoi-Iovg!  mechanisms 
m.'inipulatod  by  i  lower  level  policy;  and  so  on.  At  tfas  time,  it  is  dillicuU  lo  see  exactly  huv,'  sucli  a 
hi'' rorchical  decnmposi'iiun  cl  'ricililies  wiM  aid  in  tiir-ir  (k.'r.ign  of  p.-'rfnrm.’iv f'ci  llie  inc  ent,  we 
h  'vr’  dcc  irled  !u  if'idy  Ihe  |■.(lllOll  of  pclicy/mc.. Iiaiiism  c-  pai.-dic'ii  jt  only  i  c.ifujio,  mr'  mm'ifui  level. 


94 


(Our  system  was  not  intended  to  examine  this  policy/mechanism  hierarchcy,  and,  once  again,  it  is  not 
central  to  the  development  of  ArchOS  as  a  research  vehicle.) 

Third,  although  we  are  currently  employing  only  two  client-defined  policies,  the  primitives  provided 
for  the  definition  of  policies  by  the  application  program  are  intended  to  be  used  for  any  other  policies 
that  may  be  defined.  They  are  very  generic  in  nature  (associating  a  policy  module  with  a  special 
policy  name  and  associating  values  with  attribute  names  that  may  later  be  examined  in  carrying  out 
policy  decisions)  and  were  intentionally  selected  to  support  a  wide-range  of  policy  definitions.  We  felt 
that  a  general  approach  to  policy  definition  was  more  desirable  than  a  specialized  approach  for  each 
policy  to  be  handled. 

Finally,  we  have  given  the  implementation  of  an  application-defined  process  scheduling  policy  some 
thought.  Since  this  facility  must  be  accessed  often  and  it  must  manipulate  low  level  operating  system 
objects  (such  as  the  queue  of  runnable  processes),  it  would  be  advantageous  for  it  to  be  resident  in 
the  kernel’s  memory  space.  We  intend  to  have  a  method  to  allow  at  least  that  special  case  of  an 
application-defined  policy  module  be  resident  in  kernel  space.  Of  course,  updating  the  process 
scheduling  policy  and  coordinating  the  access  of  the  policy  module  with  client-space  data  will 
provide  some  additional  complications,  but  these  should  not  be  too  great. 

4.4.3  ArchOS  Primitives 

The  ArchOS  primitives  described  in  this  report  can  be  divided  into  two  levels.  The  primitives  which 
belong  to  the  ArchOS  kernel  level  are  called  kernel  primitives  and  the  primitives  which  are 
implemented  above  the  kernel  are  called  system  primitives.  This  distinction  will  help  an  application 
designer  but  we  did  not  provide  any  indication  of  which  is  which  at  this  stage.  The  relation  between 
the  system  and  kernel  primitives  will  be  described  in  more  detail  in  the  System  Architecture 
Specification  document. 

This  section  starts  by  briefly  pointing  out  the  important  design  problems  for  the  complete  set  of 
ArchOS  primitives.  It  then  describes  our  design  decisions  for  each  primitive,  reflecting  the  structure 
in  the  the  previous  chapter. 

4.4.3. 1  Important  Design  and  Research  Problems 

In  the  design  of  the  ArchOS  primiiives,  wo  consider  simplicity  and  uniformity  to  be  Iho  most 
iniportant  design  goals.  All  of  the  ArchOS  primitives  should  provide  a  uniform  interface  to  a  client. 
Thus,  a  piimitive  is  acccssc''!  by  a  procedure  (or  functinn)  invocation. 


95 


While  we  are  trying  to  achieve  simplicity,  we  also  consider  the  primitive  set’s  optimality  as  well  as  its 
completeness.  Unfortunately,  there  is  no  formal  notion  of  a  complete  set  of  primitives  in  a  distributed 
operating  system  context  [Tokuda  83a],  so  we  have  tried  to  evaluate  the  primitives’  expressive  power 
in  various  distributed  applications. 

•  Simplicity; 

We  have  adopted  a  procedure  (or  function)  call  interface  for  all  of  our  primitives.  Even 
though  the  request  message  must  be  transferred  to  a  suitable  arobject  by  a  Request 
primitive,  it  will  be  called  by  a  procedure  (or  function)  interface.  It  should  be  noted  that 
our  original  intention  was  to  reduce  the  number  of  primitives  visible  from  a  client  process. 

For  instance,  a  single  Create  primitive  would  be  sufficient  if  the  target  language  allowed 
"overloading"  of  procedures/functions.  The  current  CreateArobject  and  CreateProcess 
could  be  united  and  used  as  follows; 

arobject-id  =  Create(arobj-name); 
process- id  =  Create(process-name): 

One  weak  point  is  the  lack  of  a  notion  of  default  parameters.  For  example,  the  default 
value  of  the  node-id  argument  in  the  CreateProcess  primitive  could  be  set  to  "AN'V- 
NODE",  so  that  a  caller  would  not  need  to  specify  this  optional  argument  every  time. 
However,  the  current  ArchOS  interface  requires  that  the  client  must  specify  all  parameter 
values  explicitly  at  calling  time,  (due  to  the  limitation  of  C  language.) 

9  Uniformity; 

We  have  provided  a  uniform  syntax  as  well  as  uniform  (i.e.,  netv/ork  transparent  or 
location  independent)  semantics  for  each  primitive.  For  instance,  communication 
primitives  provide  the  identical  semantics  for  local  and  remote  communications.  Arobject 
and  process  management  primitives  also  provide  uniform  semantics  for  creating  and 
destroying  an  instance. 

•  Optimality  and  Completeness; 

While  attempting  to  minimize  the  number  of  primitives  in  ArchOS,  we  paid  careful 
attention  to  assure  that  the  full  expressive  power  of  the  primitive  set  was  maintained.  On 
the  other  hand,  many  primitives  were  added  to  support  new  facilities  such  as  transaction 
and  policy  management.  Al.so  duo  to  some  language  constraints,  we  needed  to  increase 
the  tiurnber  of  primitives  a  little,  but  tiie  current  primitive  set  can  be  used  to  construct  an 
extremely  diverse  set  of  distributed  applic.ations. 

4. 4. 3. 2  Arobjeot/Process  Management 

VJ13  view  an  arobject  as  a  basic  module  for  embodying  a  distributed  abstract  data  typo.  In  particular, 
v;c  decided  to  treat  an  arobject  as  an  active  system  entity  rather  than  a  pasc-ivo  entity.  There  are  many 
advantages  and  disadvantages  related  to  this  riecision.  First,  we  c.an  ea.sily  dedine  an  autonomous 
module.  It  is  easy  for  an  aiobjcct  not  only  to  initialize  its  computational  st.ale  by  itself,  but  cilso  to 
recover  its  computational  stale  by  itself.  In  other  words,  by  leaving  a  single  INITIAL  process  in  each 
nroljj.act.  a  designer  can  give  rocponclbiiity  for  recovery  to  il;e  INITIAL  proccoS.  Second,  Uii'i'^e 


96 


traditional  procedure  invocation  in  abstract  data  types,  a  caller  cannot  invoke  an  operation  of  an 
arobject  in  a  master-slave  manner,  but  must  use  a  form  of  rendezvous.  The  receiver  therefore  has  the 
right  to  accept,  reject,  or  delay  the  requested  function.  Finally,  the  degree  of  parallelism  within  an 
arobject  can  be  dynamically  changed  in  many  ways.  For  instance,  a  process  can  be  created  in  a 
different  node  to  perform  a  requested  computation  on  demand  or  many  processes  can  be  pre-created 
to  accept  a  particular  type  of  request. 


We  provided  a  completely  network  transparent  arobject/process  management.  That  is,  creation 
and  destruction  of  arobjects  and  processes  can  be  performed  v^ithout  knowing  the  location  of  the 
target  arobject  or  process. 


The  following  functionalities  are  not  adopted  for  the  current  ArchOS: 

•  Multiple,  simultaneous  creation  of  arobjects  and  processes 

This  might  be  useful  to  instantiate  a  replicated  arobject  simultaneously.  However,  it 
creates  more  conflict  with  other  optional  arguments  such  as  node-id.  Thus,  it  must  be 
invoked  once  for  each  arobject  or  process  instance  in  the  current  system. 

•  Hierarchical  dependency  among  arobjects 

There  are  no  hierarchical  dependency  among  arobjects  except  for  nested  arobjects  in  a 
distributed  program.  Thus,  a  single  arobject  can  be  created  and  destroyed  with  no  effect 
on  the  rest  of  the  external  the  arobjects  in  the  same  distributed  program.  However,  it 
should  be  noted  that  nested  arobjects  are  killed  when  enclosing  arobject  is  killed, 

•  Inheritance  relationship  among  arobjects 

There  are  no  inheritance  relationship  among  arobjects.  In  modern  programming 
languages,  such  as  Smalltalk-80  [Goldburg  83],  Flavor  [Weinreb  81],  and 
LOOPS  [Bobrow  81],  the  inheritance  relationships  among  objects  have  shown  great 
advantages,  improving  the  structural  sharing  among  abstract  objects  and  simplifying  the 
modification  of  the  existing  objects.  We  considered  a  multiple  inheritance  relationship 
among  arobjects,  but  there  were  many  difficult  problems  in  providing  a  way  to  forward  a 
request  message  to  its  superfclass)  arobject.  In  cur  arobject  paradigm,  wo  wished  to 
avoid  an  unnecessarily  (deep)  iiic-.earchy  c>.mnng  arobjects;  thi'S  wo  did  not  .adopt  the 
inheritance  relatio.oship.  However,  tiie  current  model  can  cuppciT  a  nested  arobject 
which  is  private  to  the  outer  arobject. 

•  Hierarchical  dependency  among  rooperoting  processes 

Ihero  is  no  hierarchical  relationship  among  cooperating  processes.  For  in.stance,  in 
ADA  [Ad, a  03],  the  termination  of  a  task  depends  upon  the  termination  of  its  inner  blocks’ 
tasks.  Thus,  the  termination  of  a  higher-level  task  may  be  delayed.  In  Shoshin  [Tokuda 
8obJ,  every  process  must  belong  to  a  f.imdy  tree  which  indicates  such  termination 
dcpcnd.?ncies.  A  process  in  Shnshin  can  bo  attar. h,;d  to  or  detached  from  ih,,  creator’s 
family  tree  at  ci  cation  time.  This  dependoney  iidormation  might  ba  useful  during  the 
debiigginrj  phase,  but  Ihe  curre;il  syst  em  docs  nui  support  thi:',  i',iciliiy.  That  is,  a  procccs 
can  f.ill  other  procc'.se.s  or  toniiinaio  itself  wniiosi  '.ausing  any  s.drl'iirjnal  l,ii!i.:rj  among 
crjrnmiiniceiing  procvoscs. 


97 


process  accesses  a  private  data  object  from  a  remote  node,  a  conventional  remote 
procedure  call  will  be  used.  If  such  a  data  object  and  the  calling  process  are  located  in 
the  same  node,  then  the  normal  procedure  call  will  be  used  with  identical  semantics  to 
the  RPC  case. 

•  Built-in  timeout  facility  in  communication  primitives 
In  order  to  bound  the  execution  time  of  communication  activity,  we  considered  adding 
one  more  argument,  namely  a  timeout  value,  to  each  communication  primitive.  However, 
we  preferred  to  give  the  user  a  timeout  facility  outside  the  communication  primitives.  It 
should  be  noted  that  each  communication  primitive  may  contain  a  compound  transaction 
so  that  it  is  also  bounded  internally. 

4. 4. 3. 4  Synchronization 

There  are  two  levels  of  synchronization  support  in  ArchOS.  One  is  a  critical  region  for  controlling 
shared  private  data  objects  and  the  other  is  explicit  locking  primitives  for  controlling  inter-transaction 
activities. 

There  are  many  problems  with  the  critical  region  construct.  For  instance,  if  a  process  dies  v/ithin  a 
critical  region,  the  state  of  the  shared  objects  cannot  easily  be  restored.  It  is  also  difficult  to  bound 
the  total  waiting  time  as  well  as  the  execution  time  of  the  critical  region. 

Despite  these  difficulties,  a  critical  region  scheme  was  adopted,  since  vjq  expect  that  creating  a 
compound  transaction  for  such  a  shared  object  may  be  unnecessary  in  certain  situations.  For 
instance,  there  are  cases  in  whicn  permanency  of  data  objects  is  sufficient  without  providing  full 
failure  atomicity. 

The  explicit  lock  and  release  primitives  were  necessary  to  control  concurrent  transaction  activities. 
It  is  clr'ar  that  by  using  a  discrete  lock  type,  we  could  encounter  a  deadlock.  In  case  of  a  simple 
deadlock,  ArohOS  migfit  detect  and  resolve  it;  however,  in  general,  it  will  remain  the  user's 
to  yneJ  rGsc!v'5  problems, 

Qn  the  other  hand,  by  using  a  free  lock  orotocol  a  user  can  avoid  tlie  deadlock  pioblem  in  some 
cases.  I  lov/e.'er,  a  user  must  decia.'c  tt'.o  dependency  among  leeks  in  a  tree  by  using  ttie  CreaicLock 
prii'iiti  .'o.  (-bee  Section  A.A.2.7). 

Tin;  '■.iirr  mt  synchroniz.ition  management  does  not  support  or  adopt  the  followis.o  functionality; 

o  F,  rof  recovery  in  -a  critical  region  sefterno 
\'l':  Ijs-liove  tlint  any  sh-Trerl  oiii’Xt  v/liicli  rc.iuif-.s  failure  rusmicily  cheukl  be  accessed 
f' ■  1,0  1  trans  ’' 'isn.  I'cr  iri  ',''eru;o,  vh;  c:  a  .m  t.f"nponnd  h.’: i:,  !c!iou  to  i'l.'piaoc  tlio 
'  .  i,i(,:;l  I  egior;  !;/  using  the  ST (  ...  ]  con'-.ti  ucf. 


98 


•  Timeout  (exception)  handling  within  a  critical  region 

The  critical  region  construct  simply  takes  a  timeout  parameter  to  bound  the  total 
execution  time  of  the  caller.  This  timeout  mechanism  does  not  provide  a  corresponding 
exception  handler.  Thus  a  user  must  provide  its  function. 

4. 4. 3. 5  Transaction/Recovery  Management 

A  compound  or  elementary  transaction  construct  creates  a  new  transaction  scope  in  a  client 
process.  Within  this  scope,  a  client  can  access  atomic  objects  as  if  these  computational  steps  were 
executed  alone.  However,  there  are  several  restrictions  on  these  steps  needed  to  maintain  the  basic 
properties  of  transactions.  These  restrictions  are  as  follows: 

•  Between  any  two  transactions,  a  transaction  cannot  pass  any  its  computational  state  to 
the  other  transaction  by  using  a  non-atomic  data  object.  The  following  Example  1  shows 
two  illegal  transaction  scopes  in  a  single  process. 


Example  1:  Computational  State  Passing  via  a  Non-atomic  Data  Object 

Arobject  Server  body 

{ 

process  INITIAL(parameters) 

{  ... 

NormalType  State^,  Stategi  /*  Non-Atomic  objects  •/ 
ET^(timeout){ 

State^  =  Computej(argj,  ....  arg^) 


ET2( timeout){ 

State2  =  Comput02(Statej ,  arg^ . ^''9k) 

}  ^ 

} 


In  Example  1,  the  illegal  (state)  passing  occurs  within  the  INIT  process  by  using  the 
non-atomic  data  object,  Slate.|.  Since  ETg  depends  on  the  value  of  Slate.|,  the 
transaction  property  of  ET^  will  not  maintained, 

•  From  two  independent  transaction  scopes,  two  arobjects  cannot  communicate  with  each 
other  by  using  a  communication  primitive  or  a  shared  abstract  data  object.  This  is  an 
example  of  so-called  cuoperaling  trancactions  [Sha  G3b]  and  the  current  ArchOS  model 
does  not  support  it. 

For  inrsrance,  the  following  two  examples  show  normtil  and  coopc-raiincj  transaction 
scopes. 


Example  2:  A  Normal  Transaction  Scope 


Arobject  Requestor  body 
{  ••• 

process  INITIAL(parameters) 

{  ■  •  ■ 

ET ( timeout){ 

Request(Server-Aid.  SERVICE,  reqmsg,  repmsg); 

.  .  .  steps  .  .  . 

> 

} 

} 

Arobject  Server  body 

{ 

process  INITIAL(parameters) 

{  •  •  . 

while  (true)  { 

t-opr-aid  =  AcceptAny(ANYOPR ,  reqmsg): 

CT(timeout){  /*  begin  compound  transaction  •/ 

.  .  .  steps  .  .  . 

do<-service(t-opr-aid.opr,  reqmsg,  replymsg); 
}  /•  end  transaction  •/ 

Reply ( t-opr-aid. tid,  replymsg) ; 

} 


100 


Example  3:  A  Cooperating  Transaction  Scope 

Arobject  Requestor  body 

{  .... 

process  INITIAL(message) 

{  •  ■  • 

ET(timeout){ 

Request(Server-Aid,  SERVICE,  reqmsg,  repmsg); 

.  .  .  steps  .  .  . 

} 

} 

> 

Arobject  Server  body 

{  ... 

process  INITIAL(message) 

{  •  •  • 

while  (true)  { 

CT(timeout){  /•  begin  compound  transaction  */ 
t-opr-aid  =  AcceptAny{ANYOPR,  reqmsg): 

.  .  .  steps  .  .  . 

do^service(t-opr-aid,opr ,  reqmsg,  replymsg); 
Reply(t-opr-aid. tid ,  replymsg); 

}  /*  end  transaction  •/ 

> 

> 

} 


In  Example  2,  it  is  easy  to  determine  the  execution  sequence  of  the  two  transactions. 
That  is,  ET  in  the  requestor  happened  before  CT  in  the  server.  On  the  other  hand. 
Example  3  shows  two  concurrent  arobjects,  namely  two  INITIAL  processes, 
communicating  with  each  other  from  within  two  separate  transactions. 

•  If  an  elementary  transaction  contains  a  nested  compound  transaction,  there  is  a 
possibility  of  deadlock  due  to  the  nature  of  the  lock  management.  The  following  example 
show.s  the  possibility  of  deadlock  within  a  single  transaction. 


101 


Example-4;  A  Deadlock  within  a  Single  Transaction 


Arobject  Server  body 
{  ••• 

Process  Server(inessage) 

{  •  .  • 

ET(timeout)  {  /*  begin  elementary  transaction  •/ 

.  .  .  ET  step  1  .  .  . 

sval  =  SetLock(DISCRETE.  Lx.  WRITE)  /•  get  a  lock  on  ob 
.  .  .  operate  on  X  .  .  . 


CT(timeout){  /*  begin  compound  transaction  */ 

.  .  .  CT  steps  .  .  . 

/•  get  a  lock  on  object  X  •/ 
sval  =  SetLock(DISCRETE.  Lx,  WRITE) 

<<•  .  .  operate  on  X  .  .  .>>  /*  deadlock!  •/ 

}  /*  end  transaction  •/ 

.  .  .  ET  step  3  .  .  . 

}  /•  end  transaction  */ 

} 


The  problem  occurs  when  the  nested  compound  transaction  tries  to  obtain  a  lock  on 
object  X.  Since  the  lock  was  already  taken  by  the  top  level  transaction  ET,  this  compound 
transaction  will  abort  due  to  timeout. 

The  current  transaction/recovery  management  facility  does  not  support  or  adopt  the  following 
functionality: 

•  Cooperating  transactions 

Cooperating  transactions  are  not  supported  as  explained  above. 

•  Detection  of  deadlock  in  the  lock  management 

Complete  detection  in  the  locking  facility  is  not  provided.  A  detection  mechanism,  in 
general,  was  left  for  the  application  designer.  Hov.'over,  the  proper  use  of  tree- locks  may 
help  tiic  client  to  avoid  deadlock. 

o  Automatic  generation  of  a  compensate  action  for  a  compound  transaction 
It  is  veiy  complicatoci  to  automatically  generate  a  compensate  action  for  each  operation 
involved  by  a  compound  transaction.  Tiius,  this  was  loft  (or  the  application  designer. 
ArchOS  .supports  only  the  automatic  execution  of  well  defined  coiopensate  a ''ions  when 
a  compound  transaction  is  aborted. 

o  Automatic  lock  man.sgcment  for  a  slimed  object 
To  determine  the  proper  set  of  leeks  and  their  locking  rules  from  the  prograiri  is  not  an 
easy  i.isk,  and  often  the  amount  of  sharing  obtained  is  limiticci  due  to  sirnpleounded 
locking  rules.  The  locl^  management  is  also  left  far  the  applicalion  designer. 


102 


4. 4. 0.6  File  Management 

The  current  file  management  in  ArchOS  supports  a  single  level  file  structure  and  does  not  provide 
any  directory  structure  for  users.  However,  the  system  can  provide  at  least  three  kinds  of  file 
properties  First,  a  normal  file  has  the  data  portion  of  the  file  arobject  in  volatile  storage.  Second,  a 
permn-  -nt  file's  data  portion  is  allocated  in  permanent  (non-volatile)  storage.  Finally,  an  atomic  file 
has  the  data  portion  which  is  declared  as  an  atomic  data  object. 

Since  there  is  no  notion  of  file  protection  at  this  level,  a  protection  domain  can  be  build  based  on  the 
scope  of  the  reference  name. 

The  current  file  management  does  not  support  or  adopt  the  following  functionality.  (Note  that  the 
following  features  are  not  permanently  removed  from  ArchOS.  These  functionalities  may  be  easily 
adopted  on  top  o*  the  current  file  arobject's  structure). 

•  Directory  structure 

There  is  no  directory  structure  in  the  current  file  system.  A  client  must  use  a  fiat  file  name 
space  to  classify  files. 

«  A  replicated  atomic  file  type 

This  replicated  file  type  must  vary  in  terms  of  its  access  protocols  and  replication 
schemes.  Tfius,  there  is  no  system  supported  replicated  file  type  so  far. 

«  File  sharing  and  protection  scheme 

There  is  no  conventional  access  list  for  a  file  object,  rather  each  file  will  have  a  set  of 
lodsS  to  control  concurrent  file  accesses. 

It  sh.ould  be  noted  that  if  a  normal  or  permanent  file  is  accessed  from  a  transaction  scope,  all  of  the 
transaction  properties  will  not  be  provided.  The  file  must  be  an  atomic  file  type  in  order  to  fully  utilize 
the  transaction  facility. 

4.4.tt./’  I/O  Device  '.’.inogcroi'iit 

Tiie  1/0  (Jovice  nianagemont  provides  a  set  of  neceGS  primitives  for  normal  or  sp-ciol  devices  in  the 
system.  These  priiiii‘iv.3G  arc  similar  to  the  ones  which  are  used  for  file  managomont,  but  ArchOS 
does  ri'ji  provid-j  complete  compatibility  between  the  two. 

The  current  I/O  d''  vice  management  does  not  support  or  adopt  the  following  functionality. 

•  |■!!ll  compalibilily  between  devicor,  and  files 

Th-''  main  ,on  v/as  that  it  is  difficult  to  cover  all  different,  specialized  devices  as 
■landard  (i.c  ,  atomic,  permanent  or  normal)  files. 

•I  1  I  ..ns.ictira;  l.icililio';  .arc  not  supported  on  Ih.c  rani  devices 


103 


Unlike  arobjects,  a  user  cannot  create  a  transaction  for  a  sequence  of  device  operations. 

This  is  because  ’'undo"s  are  often  impossible  following  real  actions  taken  on  these 
devices. 

4. 4. 3. 8  Time  Management 

The  time  management  primitives  provide  functions  for  obtaining  getting  time  information  and/or 
setting/resetting  scheduling  parameters  for  performing  time-driven  scheduling.  This  time-driven 
scheduling  is  based  on  ArchOS’  best  effort  scheduling  model.  Since  this  model  requires  at  least 
request,  delay,  deadline,  and  estimated  execution  times  as  basic  parameters,  the  Delay  and  Alarm 
primitives  were  designed  to  provide  these  parameters  from  a  client  process. 

There  are  several  approaches  to  the  provision  of  these  functions  in  a  real  time  operating  system,  but 
these  were  chosen  in  an  attempt  to  provide  maintainability  and  modularity  as  well  as  time  control.  For 
example,  many  systems  provide  a  process  time-out  setting  to  be  made  at  which  an  application 
process  wiil  be  scheduled.  This  frequently  requires  breaking  a  module  at  the  delay  point  into  two 
processes,  rendering  the  processing  difficult  to  read  and  understand.  The  Delay  and  Alarm  primitives 
provide  for  such  delays  to  be  coded  inline  in  the  application  without  breaking  the  process  into 
multiple  processes. 

At  the  same  time,  the  deadline-driven  scheduling  algorithm  to  be  used  in  ArchOS  is  provided  v/ith 
the  time  parameters  needed  to  apply  deadline  policies.  A  part  of  the  research  being  conducted  on 
the  Archons  project  concerns  process  scheduling  in  the  presence  of  application-defined  deadlines.  It 
is  intended  that  application-defined  policies  will  be  used  to  control  deadline-driven  scheduling  which 
will  attempt  to  meet  deadlines  or  minimize  the  damage  if  insufficient  resources  are  available  to  meed 
deadlines. 


The  current  time  management  does  not  support  or  adopt  the  following  functionality: 

o  Asynchronous  Delay  or  Alarm  primitive 
Since  v/o  vyould  like  to  avoid  intenupt-driven  timeout  routines,  the  Delay  and  Alarm 
primitives  are  blocking  primitives.  Tfiis  reduces  the  comi.'lexity  requited  in  the  structure 
of  a  process  body.  To  previde  this  type  of  asynchronous  handling,  the  user  may  create  a 
new  (tiir.e)  process  to  avoid  llic  blocking. 


4.4.3.D  Policy  Matmgemcnt 

The  pol'cy  :r,anayement  priihitives  provide  lunciions  whioii  iiii|ilenient  ilte  basic  structure  of  our 
policy  modulo.  The  policy  sol  arobject  rnuint.dn.s  a  .set  of  policy  modules  in  a  distributed  program.  A 
policy  morltile  ropiee.onts  a  real  policy  and  rr.junl.iincd  by  by  ArchOS  or  thr'  poiicy  set  aiofij’.’Ct. 


104 


The  current  policy  management  does  not  support  or  adopt  the  following  functionality: 

•  Dynamic  addition/deletion  of  a  new  user-defined  policy  module 
The  current  model  can  only  support  predefined  set  of  user  policies,  thus  a  new  user- 
defined  policy  cannot  be  recognized  by  ArchOS. 

4.4.3.10  System  Monitoring  and  Debugging  Support 
The  system  monitoring  and  debugging  facilities  provide  various  abilities  to  control  the  behavior  of 
cooperating  arobjects  and  their  processes  during  execution.  All  of  the  primitives  for  system 
monitoring  and  debugging  can  only  be  used  by  a  specially  privileged  arobject,  so  that  a  normal  client 
arobject  cannot  issue  any  of  them. 


The  current  system  monitoring  and  debugging  facilities  do  not  support  or  adopt  the  following 
functionality; 

•  Single-step  function  for  tracing  process  activity 

A  single  step  function  is  not  supported  yet,  but  may  be  added  in  the  future. 

•  Creating  an  arbitrary  break  point  in  an  arobject 

A  client  cannot  set  an  arbitrary  break  point  in  an  arobject.  The  available  breaking  points 
are  only  the  communication  points  where  a  process  sends  or  receives  a  message.  A 
monitoring  process  may  trap  the  following  execution  step  by  using  the  CaptureComm  or 
VVatchComm  primitives. 

u  Specialized  debug  functions  such  as  Redo  or  Undo  operation  for  arbitrary 
functions 

These  facilities  should  be  built  by  using  the  current  facilities,  rather  than  creating  a  set  of 
new  primitives. 


105 


A.  Program  Examples 

This  appendix  presents  an  example  problem  involving  a  distributed  data  object  along  with  two 
potential  solutions  employing  arobjects  and  the  ArchOS  system  primitives. 

A.1  The  Problem:  A  Distributed  Directory  with  Partially 
Replicated  Data 

In  this  example,  the  problem  is  to  implement  a  distributed  directory,  where  the  directory  data  is 
physically  spread  among  several  processing  nodes  with  each  directory  entry  replicated  at  multiple 
nodes  in  order  to  improve  reliability.  Notice  that  there  is  no  guarantee  that  every  node  has  a  complete 
copy  of  the  directory  information. 

In  particular,  the  problem  is  to  construct  a  distributed  directory  of  some  fixed  length,  where  each 
entry  in  the  directory  associates  a  name  (a  string  of  characters)  with  a  value  (in  the  examples,  the 
values  used  are  assumed  to  be  integers).  The  data  in  the  directory  must  be  partially  replicated  at 
several  different  processing  nodes,  and  directory  users  must  always  have  a  consistent  view  of  the 
directory. 

This  type  of  distributed  directory  represents  a  class  of  data  object  that  v/ill  be  common  in  an  ArchOS 
client's  distributed  program--  an  object  that  is  highly  robust  (due  to  the  presence  of  multiple  copies  of 
each  critical  data  item  at  physically  separate  nodes)  but  does  not  require  that  the  user  of  this  data 
object  be  aware  of  the  physical  distribution  involved  in  the  implementation.  (In  fact,  this 
implementation  is  invoked  by  the  user  in  exactly  the  same  manner  as  a  centralized  data  object.) 

A. 2  The  General  Approach  to  the  Problem 

The  solutions  presented  in  the  next  two  sections  are  both  buiit  upon  the  same  fundamental 
approach.  The  clire.story  data  is  contained  in  several  different  arobjects  (called  directory-copy 
arobjects),  located  at  different  processing  nodes.  These  arobjects  each  contain  some  portion  of  the 
total  directory  strito,  fiut  it  is  unlikely  that  any  two  of  them  contain  exactly  the  same  data  or  that  any 
one  of  them  contains  all  of  the  current  directory  data. 

Another  entity  is  provided  to  consistently  access  and  maintain  the  directory-copy  arobiccls  so  that 
thf;  directory  user  "sees"  a  singio  directory  object  (called  the  directory  arci'ject).  This  entity  takes  a 
differ'  r.t  form  in  Mie  two  r, elutions  that  follov/.  and  this  is  their  major  distingnirliiiuj  foaiuro.  hi  the  first 
e/amrilc,  one  ei  more  directo.'y  arolijccts  ueeept  directoiy  opeiatipn  invoc  iticns  an^l  tlien  service 


106 


them  by  making  the  appropriate  invocations  on  the  directory-copy  arobjects.  All  of  the  arobjects  in 
this  solution  are  separate  entities,  and  so  this  solution  views  the  organization  of  arobjects  as  being 
quite  "flat." 

In  the  second  solution,  the  entire  directory  service  is  provided  by  a  single  arobject.  This  arobject  is 
physically  distributed  over  several  nodes  and  has  placed  multiple  processes  throughout  the  system  to 
accept  directory  operation  invocations.  In  addition,  the  directory-copy  arobjects  are  embedded 
within  the  directory  arobject  as  private  arobjects  in  the  second  solution.  As  a  result,  this  solution 
views  arobjects  as  being  organized  in  a  hierarchical  manner. 

ArchOS  supports  both  views  of  the  organization  of  arobjects  (flat  or  hierarchical).  Neither  of  them 
appears  to  be  better  than  the  other  for  this  example,  but  the  client  is  free  to  organize  arobjects  in  the 
manner  that  seems  most  advantageous  for  the  application  at  hand. 

Both  solutions  manage  the  coordination  of  the  multiple  directory  copies  in  the  same  way;  by 
applying  the  notion  of  using  only  a  quorum  [Gifford  79,  Herlihy  84]  of  directory  copies  in  order  to  carry 
out  any  operation  on  the  distributed  directory.  Using  this  method  allows  the  directory  to  continue 
operating  smoothly  even  if  several  directory  copies  are  unavailable  at  any  given  time.  Each  directory 
operation  (lookup,  add,  delete)  requires  only  a  quorum  of  directory  copies  be  involved  to  perform  that 
operation.  In  order  to  determine  the  latest  value  of  a  given  name,  the  directory-copy  arobjects  also 
maintain  a  version  number  with  each  entry  name.  (The  current  value  of  a  given  name  corresponds  to 
the  entry  for  that  name  with  the  highest  version  number  in  any  directory  copy.) 

Also,  notice  that  the  transaction  facility  has  been  used  in  order  to  guarantee  the  consistency  of  the 
directory  information  at  all  times.  Elementary  transactions  have  been  used  for  the  most  part  since  the 
directory  operations  invoked  by  a  user  of  the  directory  arobject  may  be  part  of  a  larger  transaction, 
diici  in  that  c.ase,  if  compound  transactions  had  be-on  used,  the  ccnsistoncy  cf  the  directory  could  no 
longer  bo  guarnritced. 

A. 3  Solution  1 :  Tv^o  Cooperating  Classes  of  Arobjects 

As  ouliinetl  in  the  previous  section,  this  .solution  uses  several  separate  directory  and  directory-copy 
arobjects  disiril.uled  throughout  the  system.  The  code  for  these  arobject  types  is  assumed  to  be 
contained  in  !v;o  files:  "directory. arb"  contains  the  code  tor  the  directory  arobject  and  "dir-copy.arb" 
contains  the  code  for  the  dircdory-copy  arnbjoct. 


107 


•  This  code  Is  contained  In  the  file  "dlpectopy.arb" 


#def1ne  DIR-COPIES  2*N+1 

^define  READ-QUORUH  N+l 

iKdeflne  WRITE-QUORUM  N-t-1 

#def1ne  DEL-QUORUM  2*N+1 


/*  If  a  DEL-QUORUM  cannot  be  fopmod.  NULL  can  be  wpitten  to  the  value 
field  of  a  new  vepslon  of  thedesiped  name.  Latep,  when  all 
copies  ape  available,  a  delete  can  again  be  attempted.  */ 

apobject  dipectory  specification 

{ 

chap  *name: 

Int  value; 

STATUS  pesult; 

opopatlon  add(name,  value)  — >  (pesult); 
opepotlon  del8to(name)  — >  (pesult); 
operation  flnd(name)  -->  (result,  value); 

} 

arobjoct  directory  body 

{ 

•»*•••  Insert  massage  typo  declarations  hora  •**»» 

/•  Private  Abstract  Data  Type  Definitions  •/ 
pri vate-abstract-data-type  OCNAME  ■  { 
permanent  AR03J-REFNAME  d1 r-copy-name; 

/•  Procedures  to  Operate  on  Atomic  Data  Items  •/ 

procedure  store(dcname) 

AROaJ-REFNAME  dcname; 

{ 

d1 r-copy-name  ■  dcname; 

> 

AR08J-REFNAHE  function  get() 

{ 

roturn(d1 r-copy-namo) ; 

} 

}  /•  end  of  pri vato-abstract-dnta-typo  DCNAME  •/ 

DCflAfiE  dcname; 

process  IfJlTIAL(dlrcopynamo) 

AnOdJ-REFNAME  dcnoiiie; 

{ 

f1l-SSA6E  '^rcquestmsg ; 

UPERATIOrj  opr; 


108 


dename . $tore(d1 rcopyname) : 
whne(TRUE)  { 

AcceptAny(ANYOPR»  requestmsg); 

opr  -  msg-header-operatlon(requestmsg); 

CreateProcess(opr,  requestmsg); 

> 

} 

process  add(add-requestmsg) 

{ 

char  •name; 

Int  value; 

1nt  1; 

STATUS  rd-result? 

1nt  val [READ-QUORUM]; 

1nt  vers1on[READ-QUORUM]; 

TIME  timeout  -  TIMEOUT; 

TRANSACTION-ID  tid,  trans-id; 

PID  pid; 

struct  add-replymsg  { 

STATUS  result; 

} 

strcpy(name ,  requestmsg. body .name) ; 
value  ■  requestmsg. body. value; 

pid  "  msg-head0r-caller(add-requestmsg) ; 
tid  «  msg-header-tid(add-roquastmsg) ; 

ET(timeout)  { 

trans-id  »  SelfTid(); 

read-quorum(namo,  val,  version,  rd-result); 
if  (rd-result  1-  OK)  { 

Reply(pid.  tid.  {FAIL}); 
AbortTransaction(trans-id); 

} 

max-version  »  -1; 

for  (i«l;  -;<=  READ-QUORUM;  i++) 

max-vorslon  »  rt)ax(max-vorsion ,  vorsion[i]); 
v;/rito-quorijri(nam0 ,  value,  max-version+1,  wr-result); 
add-replymsg. body. result  =  wr-rosult; 

Roply(pid,  tid,  add-replymsg); 

} 

if  (loAborted(trans-id))  { 

/•  handle  error  condition  hero  •/ 

} 

} 

proccnliiro  road-c|iior<!ir.(name,  value,  version,  result) 
char  '•iirnv.'; 

int  valuoLrirAD-QUORliM]; 


109 


int  v0PSlon[READ-QUORUM]; 

STATUS  result: 

{ 

1nt  count: 

TRANSACTION-ID  t1d: 

FIND-REPLYMSG  f ind-replyrosg: 

count  ■  0: 

/•  Initiate  timeout  •/ 

tid  ■  RoquestAn(dcname.get(),  find,  {name}): 
while  (count  <  READ-QUORUM)  { 

GetReply(t1d,  f Ind-replymsg) : 

If  (f Ind-replymsg. body. result  =•»  OK)  { 
count++: 

value[count]  ■  f Ind-replymsg. body .value: 
vers1on[count}  ■  find-replymsg.body. version: 

} 

} 

/*  Need  to  fix  case  where  quorum  cannot  be  established.  */ 
AbortlncompleteTransactlon(tld) : 
result  ■  OK: 

} 

procedure  wr1t0-quorum(namo,  value,  version,  result) 
char  ’name: 

Int  value,  version: 

STATUS  result: 

{ 

Int  count: 

TRANSACTION-ID  tid: 
struct  add-roplymsg  . . . 

count  =  0; 

/♦  Initiate  timeout  •/ 

tid  =  RequestAl l(dcname.get() ,  add,  {name,  value,  version}) 
If  (add-replymsg. body. result  ■■  OK)  count++: 
while  (count  <  WRITE-QUORUM)  { 

GetReply( tid ,  add-replymsg): 

If  (add-replymsg. body. result  OK)  co'jnt,H  +  ; 

} 

/*  Need  to  fix  case  where  quorum  cannot  be  found.  •/ 
AbortIncomploteTran5r,ct1on(t1d) : 
result  “  OK; 

> 

/*  End  of  directory  arobject  body  •/ 


no 


•  This  code  1s  contained  In  the  file  "dir-copy .arb"  • 

*  * 


/ 


apobject  directory-copy  specification 

{ 

char  name[MAXSIZE] : 

Int  value; 

Int  version-no; 
status  result; 

operation  add(name,  value,  version-no)  — >  (result); 

operation  delote(nama)  — >  (result); 

operation  flnd(name)  — >  (result,  value,  varslon-no); 

> 

arobject  directory-copy  body 

{ 

/•  Private  Abstract  Data  Type  Definitions  •/ 
prlvate-abstract-data-type  TABLE  ■  { 

TREE-LOCK-ID  t-lock[TA8LESIZE] ; 

/•  define  tree  lock  structure  as  a  ''linear"  troe,  such  that 
t-lock[l]  Is  tho  root  and  has  child  t-lock[Z]; 
t-lock[2]  has  child  t-lock[3J; 
and  so  on.  •/ 

atomic  struct  tablo[TAOtESIZE]  { 
char  *081110; 

Int  value; 

Int  version; 

} 


/*  Procedures  to  Operate  on  Atomic  Data  Items  •/ 

procedure  1n1t-table() 

{ 

Int  1 ; 

TIME  timoout  •  TIMEOUT; 

TRAI13ACTI0fJ-ID  tid; 

/*  define  troe  :lock  structure  •/ 
t-loci;[l]  =  CrootoLock(NUl.L-LOC:<-ID); 
for  (1=Z:  1<=TAyLE6MZE;  1++)  { 

t-lock[1]  "  CroatoLock(t-lock[1-l]) ; 


CT( timeout)  { 

tid  =  SolfT1d(); 

for  (1  =  1;  I<  =  TaOI,.E3IZE;  1++)  { 

SotLock(TriEt;-LUCK.  t-1o(;k[1],  WRITE); 
table[i].namo  =  NULL; 

Roloasol.o<;k(TRF.E-LOCX,  t-lock[i],  './RITE); 

> 

} 


Ill 


1f  (IsAborted(tld))  { 

/•  handle  error  condition  •/ 

} 


procedure  1ookup(naine.  Index) 
char  *naroa; 

{ 

Int  Index.  1; 

TIME  timeout  -  TIMEOUT; 

Index  ■  -1; 

ET(tlmeout)  { 

for  (1-1;  1<-TABLESIZE;  1++)  { 

SetLock(TREE-LOCK,  t-lock[1].  READ); 

If  (strcmp(tab1e[1].name.  name)  --  NULL)  { 

Index  -  1; 
breakfor ; 

} 

ReleaseLock(TREE-LOCK,  t-lock[1].  READ); 

}  ^ 

} 

STATUS  function  0nter(1ndox,  name,  valuo,  version-no) 
char  •name; 

Int  Index,  value,  vers1on-no; 

{ 

TIME  timeout  =  TIMEOUT; 

TRANSACTION-ID  tid; 

ET(tlmeout)  { 

tid  -  SelfTidO: 

SetLock(TREE-LOCK,  t-lock[1ndox] ,  WRITE); 

If  (table[1ndox] .name  I-  NULL  ||  table[1ndex] .name  I- 
AbortTransactlon(tld) ; 
strcpy(tabl0[1ndGx].nam9,  name); 
tofclo[ index] . valuo  “  valuo; 
table[1i)d'jx]. version  -  vorslon-no; 

} 

if  (l3Con;inittod(tid) )  roturn(OK); 
else  roturn(FAIL) : 


procoduro  got -f1olds( index,  name,  value,  vorsion-no,  result) 
char  '^naiTie; 

int  indvTx,  value,  vorsion-no; 

STAIMJ  rosult  =  FAIL; 

{ 

fJMC  timeout  =  TIMEOUT; 

TKAMSACTIOfi-IO  tid; 


name) 


tT( timeout)  { 


112 


t1d  ■  SelfTidO; 

SetLock(TREE-LOCK,  t'lock[1ndex],  READ); 
strcpy(name,  tab1e[1ndex].name) ; 
value  ■  tab1e[1ndex]. value; 
vers1on-no  ■  tab1e[1ndex] .version; 

> 

if  (IsCoinmltted(tld))  result  ■  OK; 

> 

}  /•  end  of  private-abstract-data-type  TABLE  •/ 

TABLE  tab; 

process  INITIAL() 

{ 

OPERATION  opr; 
tab. 1n1t-table() ; 
while  (TRUE)  { 

AcceptAny(ANYOPR.  requestmsg); 

opr  -  msg-header-oporatlon(requestmsg) ; 

CreateProcess(opr ,  requestmsg); 

} 

} 

procass  add(add-requestmsg) 

/aQD-REQUESTMS6  add-requostmsg; 

{ 

char  ‘name; 

1nt  value,  version-no; 

Int  Index; 

STATUS  stat,  enter{); 

TIME  timeout  -  TIMEOUT; 

struct  Qdd-replymsg  { 

STATUS  result  •  FAIL; 

> 

strcpy(naine ,  add-requestnsg.body.nome) ; 
value  =  add-requestmsg. body .value; 
vorslon-no  ■  add-requestinsg. body. version; 

ET(tlmoout)  { 

index  »  lookup(name) ; 

if  (Index  <  0)  index  "  lookMp(NULL) ; 

1f  (Index  <  0)  AbortTransaction(S0lfT1d( )) ; 
stat  =  0nter(1nd0X,  name,  value,  vers1on-no); 
if  (stat  I"  OK)  AbortTransp.ct1on(SolfT1d( )) ; 
add-replymsg. body. result  =  CK; 

} 

R0ply(msg-h0ader-cal  ler(f,dd-roqu0stmsg) , 

mGg-hsad3r-t1d(add-r0nnjutiUsr)) ,  add-roplymsg) ; 

> 


113 


process  f 1nd(f 1nd-requestmsg) 

FIND-REQUESTMSG  f 1nd-requestmsg; 

{ 

char  *name; 

1nt  Index,  value.  vers1on-no; 
char  •entry-name; 

STATUS  Stat; 

struct  f Ind-replymsg 
PID  p1d; 

TRANSACTION-ID  t1d: 

strcpy(name,  f Ind-reques tmsg. body. name) ; 
p1d  ■  msg-hoader-caner(f  1nd-requestmsg) ; 
tid  ■  msg-head8r-t1d(f ind-requestmsg) ; 

Index  -  lookup(name) : 

1f  (Index  <  0)  Reply(p1d,  tid,  {NOT-FOUND>) ; 
else  { 

get-f1elds( Index,  entry-name,  value,  vers1on-no,  stat): 
if  (strcmp(name ,  entry-name)  ■*  NULL) 

RGply(pid,  tid,  {OK,  value.  vers1on-no}) ; 
else  Reply(p1d,  tid,  {FAIL}); 

> 

} 

}  /•  End  of  directory-copy  arobject  body  •/ 


114 


A. 4  Solution  2:  A  Single,  Distributed  Arobject 

As  previously  discussed,  this  solution  implements  the  entire  directory  server  as  a  single  arobject  in 
which  several  directory-copy  arobjects  have  been  included  as  private  arobjects.  As  in  the  previous 
solution,  the  code  for  the  directory  arobject  is  located  in  "directory.arb,"  while  the  code  for  the 
directory-copy  arobject  is  located  in  "dir-copy."  In  fact,  the  code  for  the  directory-copy  arobject  is 
exactly  the  same  as  in  the  first  solution;  however,  the  code  for  the  directory  arobject  has  been 
changed  somewhat.  The  greatest  change  has  taken  place  in  the  INITIAL  process  of  the.  directory 
arobject;  also,  the  directory  arobject  now  contains  a  statement  that  identifies  the  description  of  the 
directory-copy  arobject  as  a  private  arobject. 


115 


*  This  code  1s  contained  In  the  file  "directory. arb” 


#dof1ne  DIR-COPIES  2*N-H 

^define  READ-QUORUM  N+l 

#dof1ne  WRITE-QUORUM  N+1 

^define  DEL-QUORUM  2*N+1 


/*  If  a  DEL-QUORUM  cannot  be  formed.  NULL  can  be  written  to  the  value 
field  of  a  new  version  of  thedesired  name.  Later,  when  a11 
copies  are  available,  a  delete  can  again  be  attempted.  */ 

arobject  directory  specification 

{ 

char  *name: 

Int  value; 

STATUS  result: 

operation  add(name,  value)  -->  (result); 
operation  dolete(name)  — >  (result): 
operation  flnd(name)  -->  (result,  value); 

} 

arobject  directory  body 

{ 

««««*  Insert  massage  type  doclarations  here  •••*• 

/•  Private  Abstract  Data  Typo  Definitions  •/ 
pri vate-abstract-data-typG  DCNAME  »  { 
permanent  AROBJ-REFfJAME  dlr-copy-name; 

/•  Procedures  to  Operate  on  Atomic  Data  Items  •/ 

orocedure  storo(dcnamo) 

AROBJ-REFMAHE  dcname; 

{ 

dlr-copy-name  »  dcname; 

} 

AROBJ-REFNAME  function  cjet() 

{ 

roturp.(dlr-copy-name) ; 

} 

}  /•  ond  of  pri  vate-abs  tract-data-typo  DCNAMF.  '“  / 

DCNAME  dcname; 

pr Ivato-arcbjoct  directory-copy  =  "dir-copy.arb" ; 
process  IfMTIALO 

C 

NODE  .iodo[TOTAL-:;cor.r,]  =  (nooea,  ... 

int  1; 


116 


AID  aid; 

for  (1-1;  K-COPIES;  1++)  { 

aid  -  Cr0ateArobject(d1rectory-copy ,  NULL,  node[1]); 
B1ndApobJectName(a1d,  "dir-copy") ; 

> 

dename . store("d1r-copy") ; 
for  (1-1;  1<-SERVERS;  1++) 

CreateProcess(accept-1nvocs .  NULL,  node[1]); 


process  accept-1 nvocs( ) 

{ 

MESSAGE  •requestmsg; 

OPERATION  opr; 

whne(TRUE)  { 

AcceptAny(ANYOPR,  requestmsg); 

opr  -  msg-header-op0rat1on( requestmsg) ; 

Creat9Process(opr ,  requestmsg); 

> 

> 

process  add(add-*requestmsg) 

{ 

char  •name; 

1nt  value; 

Int  1 : 

STATUS  rd-resuU; 

Int  val [READ-QUORUM]; 

Int  V0rsion[READ-QUORUM]; 

TIME  timeout  -  TIMEOUT; 

TRANSACTIOfJ-ID  tid,  trans-1d; 

PID  p1d; 

struct,  add-roplymsg  { 

STATUS  result; 

> 

strcpy(ncmo .  r e quo stmsg. body. name) ; 
value  =  requestmsg, body .value; 

p1d  -  msg-hon'Jer-callor(ndd-rpnuostm';g) ; 
tid  =  msg-tioac'er-tld(add-requestmsg) ; 

ET(tlmooiit)  { 

trans-id  =>  £olfTid(); 

roa(J-ni:r)riir(;( name .  val,  version,  rd-rosult); 
If  (rd-ror.iilt  I •=  0!C)  { 

.’loplv(pid,  t1J.  {FAIL}); 

Abori  T ran:;nct1on(  i.i  ans-1d) ; 

} 

i(io.\-vof's Ion  -  -1; 


117 


for  (1-1;  1<-  READ-QUORUM;  1++) 

max-version  ■  max(max-vers1on ,  vers1on[1]); 
wr1t0-quorum(name .  value,  max-vers1on+l,  wr-result); 
add-replymsg. body. result  -  wr-result; 

Reply(p1d,  t1d,  add-replymsg); 

> 

If  (IsAborted(trans-ld))  £ 

/•  handle  error  condition  here  •/ 

> 

} 

procedure  read-quorum( name ,  value,  version,  result) 

c  h  A  '^nflinA* 

int  value[READ-QUORUM]: 
int  ver3lon[READ-QU0RUM]; 

STATUS  result; 

{ 

int  count: 

TRANSACTION-ID  tid; 

FIND-REPLYMSG  f ind-replymsg; 

count  -  0; 

/•  initiate  timeout  •/ 

tid  =  R0questAll(dcname.got(),  find,  (name)); 
while  (count  <  READ-QUORUM)  { 

G0tRoply(tid,  f ind-rep Tymsg) ; 
if  (f 1 nd-replymsg . body . resul t  --  OK)  { 
count++; 

valu0[count]  =  f ind-replymsg, body .value; 
V0rsion[count]  ■  find-roplymsg. body .version; 

} 

} 

/•  Nood  to  fix  caso  where  quorum  cannot  ba  establishod.  •/ 
AbortlncoiiipleteTransaction(tid) ; 
result  ■  OK; 

} 

procedure  vjrite-quorum(namo ,  value,  version,  result) 
char  ‘name; 

Int  value,  vjrsion; 

STATUS  rosult; 

C 

int  count; 


TRAf;SACTiOrl-ID  tid; 
struct  add-replymsg  ... 

count  "  0; 

/-  inltialo  timaout  •/ 

tid  -  R.n'juostAl  1  ( dcnaiiio .  gat() ,  odd,  {n.imu.  value,  voroion}) 
if  (ad'J-ro[)lymGq.bofJy.  roGult  *==  OK)  count+  +  ; 
whilo  (count  <  V,'r;ITI-.-0U0RU(1)  { 
ip1y(  Lid,  add- roulynisy ) ; 
if  ( uf'd-roplyrnsj).  body .  ror.L'l  L  OK)  co>niL++: 


118 


> 

/*  Need  to  fix  case  where  quorum  cannot  be  found.  */ 
AbortlncomploteTransactlon(tld) ; 
result  •  OK; 

> 

}  /♦  End  of  directory  arobject  body  •/ 


119 


This  code  Is  contained  In  the  file  "dip-copy .arb" 


arobject  directory-copy  specification 

{ 

char  natne[MAXSIZE] : 

Int  value; 

Int  version-no; 
status  result; 

operation  add(name,  value,  version-no)  — >  (result); 

operation  d0let3(natne)  -->  (result); 

operation  f1nd(na(ne)  -->  (result,  value,  version-no); 

> 

arobject  directory-copy  body 

{ 

/•  Private  Abstract  Data  Type  Definitions  •/ 
pr 1 vate-abstract-data-type  TABLE  ■  { 

TREE-LOCK-ID  t-lock[TABLESIZE] ; 

/*  define  tree  lock  structure  as  a  "linear"  tree,  such  that 
t-lock[l]  Is  the  root  and  has  child  t-lock[2]; 
t-lock[2]  has  child  t-lock[3]; 
and  so  on.  */ 

atomic  struct  tabl0[TABLESIZE]  { 
char  *08013; 

Int  value; 

Int  version; 

} 


/*  Procedures  to  Operate  on  Atomic  Data  Items  */ 

procedure  1n1t-table() 

{ 

Int  1 ; 

TIME  timeout  =  TIMEOUT; 

TRANSACT  ION -ID  tid; 

/*  define  tree  lock  structure  */ 
t-lock[l]  =  CreatoLock(HULL-LOCK-ID) ; 
for  (1=2;  1<=TADLESI2r:;  1++)  { 

t-locl;[1]  =  CroatoLock(t-lock[1-l]) ; 

} 

CT(t1ni80Ut)  { 

tid  “  So1fT1d(): 

for  (1  =  1;  1<  =  iAiiLf.f;I7.E;  i++)  { 

SutLock(TREt  LOCK,  t-lock[1],  WRITE); 
tnblfj[  1  ]  .  name  =  NULL; 

|■!.iloar.aLock(  IRF.E-LOCK,  t-lock[^].  WRITE); 

} 

} 


120 


If  (IsAborted(tld))  { 

/*  handle  error  condition  •/ 

> 

} 

procedure  lookup(name .  Index) 
char  •name; 

{ 

Int  Index,  1; 

TIME  timeout  ■  TIMEOUT; 

index  ••  -1; 

ET(tlmeout)  { 

for  (1=1;  1<-TABLESIZE;  1++)  { 

SetLock(TREE-LOCK.  t-lock[1],  READ); 
if  (strcmp(table[i].name,  name)  ■«  NULL)  { 
index  ■  1; 
breakfor ; 

> 

ReleasQLock{TREE-LOCK,  t-lock[i].  READ); 

}  ^ 

} 

STATUS  function  onter(index,  name,  value,  vorsion-no) 
char  •name; 

int  Index,  value,  vorsion-no; 

{ 

TIME  timeout  •  TIMEOUT; 

TRAiiSACTION-ID  tid; 

ET( timeout)  { 

tid  =  SelfTIdO; 

SetLock(TREE-LOCK,  t-lock[1ndex] ,  WRITE); 
if  (tablo[lndex] .name  I*  NULL  ||  tabl0[ind0x] .name  !■  name) 
AbortTransaction(tld) ; 
strcpy(tablo[ index]. name,  name) ; 
tub  la[  lndQ.\'|  .value  ■  value; 
t2l)1c[  ifidoxj  .version  *  vorsion-no; 

> 

if  (’'sCoiiirii1t,tocI(t1d) )  rotiirn(OK); 
else  ra'(:i.irn(FAIL) ; 


promdiire  .3dt-ficlclr.(  index,  n.ama,  value,  version-no,  result) 
chnr  =’nariio: 

int  index,  value,  version-no; 

Sl/'.rU'j  roju'lt  “  FAIL; 

{ 

TJfir  ticryout  “  TI.'IEOUT; 

TR'M;SACTIOfJ-!D  Lid; 

n  (t  iMir'Jit)  { 


121 


tid  -  SelfT1d(); 

SetLock(TREE-LOCK,  t-lock[1ndex] ,  READ); 
strcpy(nam8 ,  table[1ndex3.name); 
value  »  tabl0[1ndex] .value; 
vers1on-no  ■  table[1ndex3. version; 

> 

If  (IsCommltted(tld))  result  ■  OK; 

}  ^  /•  end  of  pr1 vate-abstract-data-type  TABLE  •/ 


TABLE  tab; 


process  INITIAL() 

{ 

OPERATION  opr; 
tab.1n1t-table(); 
while  (TRUE)  { 

AcceptAny(ANYOPR,  requestmsg) ; 

opr  ■=  msg-h0ader-operat1on(requestmsg) ; 

CroateProcoss(opr ,  requestmsg); 

} 

} 

process  add(add-requestmsg) 

ADD" REQUESTMSG  add-requestmsg ; 

{ 

char  •name; 

Int  value,  version-no; 

Int  Index; 

STATUS  stat,  enter(); 

TIME  timeout  ■  TIMEOUT; 


struct  add-replymsg  { 

STATUS  result  »  FAIL; 

> 


> 


strcpy(name,  add-requostmsg.body . name) ; 
value  “  add  requootT.sg. body. value; 
version-no  "  add-roquestmsg. body. version; 

ET(tlmoout)  { 

Index  »  lookup(na:(JO) ; 
if  (Index  <  0)  index  =  looktip(NULL) ; 
ir  (index  <  0)  A’oortTransar:t1on(SolfT1d()) ; 
stat  =  onter(1ndox,  namo,  valuo,  vorslon-no); 

If  (scat  1--  OX)  AhortTrar.snf;t1on(S0lfT  id()) ; 
add-rop lymsg ■ body . result  =  OK; 

Roply(m.sg-lioador-cal  1  'jr(  add-requostmsg) , 

irr.g-hoadjr- ti(j(add-req;io':tino'vi) ,  add-roplymsg) ; 


122 


process  f 1nd(f 1nd-requestmsg) 

FIND-REQUESTMSG  f Ind-requestmsg; 

{ 

char  *name; 

1nt  index,  value,  version-no; 
char  •entry-name; 

STATUS  stat; 

struct  f Ind-replymsg  ...; 

PID  pid; 

TRANSACTION-ID  tid; 

strcpy(name ,  f 1 nd-requestmsg .body . name) ; 
pid  =  msg-header-caner(f  Ind-requestmsg) ; 
tid  =  msg-header-t1d(f Ind-requestmsg) ; 

index  ■  lookup(name) ; 

if  (index  <  0)  Reply(p1d.  tid.  {N0T-F0UND>) ; 
else  { 

get-fields( index,  entry-name,  value,  version-no,  stat) 
if  (strcmp(name,  entry-name)  ■»  NULL) 

Raply(p1d,  tid,  {OK,  value,  version-no)); 
else  Reply(pid,  tid,  {FAIL}); 

} 

> 

}  /•  End  of  directory-copy  arobjoct  body  •/ 


123 


[Ada  83] 
[Ahuja  82] 


[Aimes  83] 


[Bobrow  81] 

[DeGroot  74] 


[Erman  73] 


[Erman  79] 


[Gifford  79] 


[Goldburg  83] 


[Gray  77] 


[hlcrlihy  (>1] 


81a] 


References 

Ada  Programming  Language 
1983. 

Ahuja,  Vijay. 

Routing. 

In  Design  and  Analysis  of  Computer  Communication  Networks,  chapter  7,  pages 
233-260.  McGraw-Hill.  1982. 

Aimes,  G.  T.;  Black,  A.  P.;  Lazowska,  E.  D.;  Noe,  J.  D. 

The  Eden  System:  A  Technical  Review. 

Technical  Report  83-10-05,  University  of  Washington,  October,  1983. 

Bobrow,  O.  G.;  Stefik,  M. 

The  LOOPS  Manual. 

Technical  Report  KB- VLSI-81 -13,  Xerox  Palo  Alto  Research  Center,  August,  1981. 

DeGroot,  M. 

Reaching  a  Consensus. 

Journal  of  The  American  Statistical  Association  ,  March,  1974. 

Erman,  L.  D.,  Fennell,  R.  D.,  Lesser,  V.  R.,  and  Reddy,  D.  R. 

System  Organizations  for  Speech  Understanding, 

Proceedings,  Third  InternatlonalJoint  Conference  on  Artificial  Intelligence  , 
August,  1973. 

Erman,  L.  D.,  and  V.  R.  Lesser. 

The  Hearsay-ll  System:  A  Tutorial. 

In  W.  A.  Lea  (editor).  Trends  in  Speech  Recognition,  chapter  16.  Prentice-Hall, 
1979. 

Gifford,  D.  K. 

Weighted  Voting  for  Replicated  Data. 

Operating  Systems  Review  13(5),  December,  1979. 

Goldburg,  A.;  Robson,  D. 

Smalltalk-30:  The  Language  and  Its  Implementation. 

Addison-Wesley,  1983. 

James  Gray. 

Notes  on  Data  Base  Operating  Systems. 

Technical  Flcport,  IBM  Research  Laboratory,  San  Jose,  1977. 

Hitlihy,  M.  P. 

Replicaiion  Methods  for  Abslrucl  Data  Types. 

PhD  thesis,  MIT  Laboratory  for  Computer  Science,  1984. 

Jcncen,  E.  Douglas. 

Decentralizcrl  Control. 

In  Distributed  Systems  -  Architecture  and  Implementation:  An  Advanced  Course, 
chapters.  Springer- Verl.cg,  1981. 


124 


[Jensen  81  b]  Jensen ,  E.  Douglas. 

Decentralized  Resource  Management. 

July,  1981. 

Course  Notes,  C.E.A./I.N.R.I.A/E.D.F  Ecole  d’ete  d'lnformatique,  Breau-sans- 
Nappe,  France, 

[Jensen  84]  Jensen,  E.  D.  and  Pleszkoch,  N. 

ArchOS:  A  Physically  Dispersed  Operating  System  --  An  Overview  of  its  Objectives 
and  Approach. 

IEEE  Distributed  Processing  Technical  Committee  Newsletter,  Special  Issue  on 
Distributed  Operating  Systems  ,  June,  1984. 

[Kernighan  78]  Brian  W.  Kernighan,  Dennis  M.  Ritchie. 

The  C  Programming  Language. 

Prentice- Hall,  1978. 

[Lazowska  81]  Lazowska,  E.  D.,  Levy,  H.  M.,  Aimes,  G.  T.,  Fischer,  M.  J.,  Fowler,  R.  J.,  and  Vestal, 
S.C. 

The  Architecture  of  the  Eden  System. 

Operating  Systems  Review  15{5):148-159,  December,  1981. 

[Lynch  83]  Lynch,  Nancy  A. 

Concurrency  Control  for  Resilient  Nested  Transactions. 

In  Proceedings,  Second  SIGACT-SIGMOD  Conference  on  the  Principles  of 
Database  Systems.  ACM,  1983. 

[Marschak  72]  Marschak,  J.,  and  Radner,  R., 

Economic  Theory  of  Teams. 

Yale  University  Press,  1972. 

[Moss  81]  Moss,  J.E.  B. 

Nested  Transactions:  An  Approach  to  Reliable  Distributed  Computihg. 

PhD  thesis,  Massachusetts  Institute  of  Technology,  April,  1981. 

[Rashid  81]  Rashid,  Richard  F.  and  George  G.  Roberston. 

Accent:  A  Communication  Oriented  Network  Operating  System  Kernel. 

Technical  Report  CMU-CS-81-123,  Department  of  Computer  Science,  Carnegie- 
Mellon  University,  April,  1981. 

[Schwarz  82]  Scfiwarz,  Pelur  M.  and  Alfred  Z.  Spector. 

Synchronizing  Shared  Abstract  Types. 

Technical  Report  CMU  CS-82-128,  Department  of  Computer  Science,  Carnegie- 
Mellon  University,  1902. 

[Sha  83a]  Sha,  Lui,  E.  Douglas  Jen.sen,  Richard  F.  Rashid,  and  J.  Duane  Northeutt. 

Distributed  Cooperating  Processes  and  Transactions. 

In  Synch:  onization.  Control,  and  Communication  in  Distributed  Computing 
Systems.  Academic  Press,  1983. 

[Sria  83b]  Sha,  L.,  Jensen  E.  D,  Rashid,  R.  F.,  arid  Northeutt,  J,  D. 

Distributed  Co-operating  Processes  and  Transactions. 

Proceedings  of  ACM  SIGCOMM  sympo::iiim  ,  1983. 


125 


[Sha  84] 

[Silberschatz  80] 

[Spice  79] 

[Tokuda83a] 

[Tokuda  83b] 

[Weinreb  81] 

[Wulf  81] 


Sha,  Lui. 

Synchronization  in  Distributed  Operating  Systems. 

PhD  thesis,  Department  of  Electrical  Engineering,  Carnegie- Mellon  University, 
1984. 

in  preparation. 

Silberschatz,  A.,  and  Z.  Kedem. 

Consistency  in  Hierarchical  Database  Systems. 

Journal  of  the  Association  of  Computing  Machinery  27(1),  January,  1980. 

Department  of  Computer  Science. 

Proposal  for  A  Joint  Effort  in  Personal  Computing. 

Technical  Report,  Carnegie-Mellon  University,  1979. 

Tokuda,  H;  Manning,  E.  G. 

An  Interprocess  Communication  Model  for  a  Distributed  Software  Testbed. 
Proceedings  of  ACM  SIGCOMM  symposium  ,  1983. 

Tokuda,  H.,  Radia,  S.  R.,  and  Manning,  E.  G. 

Shoshin  OS:  a  Message-based  Distributed  Operating  System  for  a  Distributed 
Software  Testbed. 

Proceedings  of  the  16th  Hawaii  Int.  Conf.  on  System  Sciences  ,  1983. 

Weinreb,  D.;  Moon,  D. 

Lisp  Machine  Manual. 

Technical  Report,  MIT  Artificial  intelligence  Laboratory,  1981. 

Wulf,  W.  A.;  Levin,  R.;  Harbison,  S.  P. 

HYDRA/C. mmp:  An  Experimental  Computer  System. 

McGravy-Hill,  1981. 


126 


3.  ArchOS  System/Subsystem  Description 


127 


1 .  Introduction 

This  document  constitutes  the  System/Subsystem  Description  of  ArchOS  (CDRL  Item  No.  A003)  for 
RADC  Contract  No.  F30602-84-C-0063,  Reconfiourable  DPP  System.  This  report  describes  the 
Archons  operating  system  (ArchOS)  at  the  system /subsystem  level. 

ArchOS  is  the  operating  system  being  designed  and  constructed  as  part  of  the  Archons  research 
project  [Jensen  84],  The  goal  of  the  project  is  to  conduct  research  into  the  issues  of  decentralization 
in  distributed  computing  systems  at  the  operating  system  level  and  below.  The  primary  research 
issues  being  investigated  within  Archons  are  decentralized  control,  team  and  consensus  decision 
making,  transaction  management,  probabilistic  algorithms,  and  architectural  support  in  a  real-time 
environment. 

It  is  planned  that  ArchOS  will  be  developed  in  three  stages.  The  first,  called  the  interim  testbed 
version,  is  now  underway  and  will  handle  a  small  set  of  Sun  Workstations^  interconnected  by  an 
Ethernet.  It  is  expected  that  this  operating  system  will  constitute  an  existence  proof  of  many  of  the 
basic  concepts  involved  in  our  research  as  applied  to  the  construction  of  a  decentralized  operating 
system.  This  system  will  later  be  followed  by  a  more  complete  testbed  operating  system  incorporating 
the  lessons  learned  in  the  interim  testbed  construction.  Finally,  an  analysis  of  the  ArchOS  structure  is 
expected  to  result  in  the  construction  of  specialized  hardware  for  the  purpose  of  executing  a 
complete  ArchOS  system,  and  ArchOS  will  be  rewritten  to  run  on  that  hardware  at  that  time. 

This  document  consists  of  two  reports,  namely  the  ArchOS  System  Architecture  Specification  and 
the  System  Design  Specification.  Chapter  2  contains  the  description  of  the  system  architecture  of 
ArchOS.  The  system  architecture  description  summarizes  the  structure  of  ArchOS  that  supports  the 
ArchOS  client  interface.  ArchOS  consists  of  the  ArchOS  kernel,  subsystems,  and  system  arobjects. 
The  ArchOS  kernel  provides  basic  mechanisms  for  litne-driven  resource  management  The 
subsystems  together  with  system  arobjects  realize  ArchOS  facilities. 

Chapter  3  contains  the  system  design  description  of  ArchOS.  The  system  design  description 
describes  the  structure  and  functionality  of  major  components  of  ArchOS,  such  as  the  ArchOS  kernel, 
the  arobject/process  management  subsystem,  the  communication  subsystem,  the  transaction 
subsystem,  the  file  subsystem,  the  I/O  device  subsystem,  the  time-driven  scheduler  .subsystem,  and 
the  monitoring  and  debugging  subsystems. 


1 


Sun  Workstation  is  a  trademark  of  SUN  Microsystems,  Inc. 


128 


1.1  Overview  of  ArchOS  Development 

In  the  previous  RADC  contract  F30602-81-C-0297,  we  studied  basic  problems  of  decentralized 
resource  management  [Jensen  83a].  In  particular,  we  created  a  new  type  of  transaction,  called  a 
compound  transaction,  for  operating  system  functions  and  developed  its  theory.  We  also  studied 
poiicy/mechanism  separation  issues  in  Interprocess  Communication  and  developed  a  decentralized 
algorithm  testing  environment,  called  DATE,  supported  by  the  4.1  BSD  UNIX  operating  system^. 
Furthermore,  we  explored  issues  in  decentralized  computer  architecture  focusing  on  processor 
architecture  topics  such  as  the  separation  of  the  operating  system  (OS)  and  application  processors  in 
a  single  node  and  the  architecture  of  the  OS  processor.  Finally,  the  first  interim  testbed  facility  was 
built  based  on  a  set  of  Sun  workstations  interconnected  by  an  Ethernet. 

Based  on  the  results  from  these  research  efforts,  we  started  the  design  and  development  of  ArchOS 
at  the  end  of  1983.  During  1984,  we  reached  several  important  milestones  in  our  ArchOS 
development  plan  which  were  previously  reported  [Jensen  85].  First,  we  finalized  our  development 
plan  and  produced  research  requirements  for  ArchOS.  The  ArchOS  development  plan  was  described 
in  our  proposal  [Jensen  83b],  and  the  research  requirements  document  v/as  included  in  Chapter  2  of 
"Functional  Description  of  ArchOS"  [Jensen  85].  Second,  we  developed  the  computational  model  of 
ArchOS  and  defined  a  client  interface.  The  ArchOS  computational  model  was  designed  to  provide  a 
high  level  of  modularity  and  maintainability  for  both  the  designer  of  real-time  applications  using 
ArchOS,  and  for  the  ArchOS  implementors  themselves.  The  ArchOS  client  interface  defines  the 
client’s  view  of  the  total  functionality  of  ArchOS.  All  of  the  ArchOS  primitives  are  also  described  at 
this  level  and  internal  system  mechanisms  are  also  being  developed.  Third,  along  with  the 
development  of  the  client  interface,  we  also  investigated  a  specification  technique  for  ArchOS.  In  this 
work,  such  questions  as  how  to  convert  the  precise  needs  of  the  client  into  the  system  resources 
which  will  satisfy  these  needs,  and  how  to  describe  a  system  which  can  provide  these  resources  were 
studied.  Fourth,  we  created  the  System  Functionality  Specification  for  ArchOS  (CDRL  Item  No. 
A002).  This  specification  describes  Archons’  guiding  concepts  and  provides  an  overviev/  of  ArchOS 
functionalities.  Finally,  the  basic  Archons  interim  testbed  facility  is  now  stable,  and  we  are  ready  to 
build  a  basic  kernel  to  support  further  experiments. 

During  1985,  we  have  started  to  design  the  ArchOS  system  architecture  based  on  the  previous 
specifications.  In  particular,  we  substantially  used  our  arobject  model  to  build  ArchOS  facilities. 


p 

UNIX  is  n  trademark  of  AT&T  Ceil  Laboratories 


129 


1 .2  Relationship  to  Previous  Specification  Documents 

This  document  contains  two  reports  which  describe  the  system  architecture  specification  (SAS)  of 
ArchOS  and  the  system  design  specification  (SDS)  of  ArchOS. 

As  we  proposed  in  [Jensen  83b],  we  have  decided  to  adopt  a  top-down  approach  to  the  design  and 
to  produce  a  series  of  specification  documents  as  we  refine  our  design  of  the  ArchOS  operating 
system.  Figure  1-1  shows  the  planned  series  of  documents  and  their  relationships  to  one  another. 

The  first  three  documents,  namely  research  requirements  (RR),  system  functionality  specification 
(SFS)  and  client  interface  specification  (Cl),  are  complete  and  are  included  in  the  "Functional 
Description  of  ArchOS",  [Jensen  85]  (CDRL  Item  No.  A002).  This  document  contains  the  next  two 
reports  in  the  design  sequence,  namely  the  system  architecture  specification  (SAS)  and  the  system 
design  specification  (SDS),  summarizing  the  internal  system  architecture  of  the  ArchOS  operating 
system. 


\ 


Archon’s  Researchers  &  Sponsors 
- v- - 


/ 


ArchOS  Implementation 


RR:  Research  Requirements 

Cl:  Client’s  Interface  Specification 

SFS:  System  Functionality  Specification 

SAS:  System  Architectural  Specification 

SDS:  System  Design  Specification 

CDS;  Component  Design  Specification 


Figu  re  1  •  1 ;  ArchOS  Development  Steps 


131 


2.  ArchOS  System  Architecture  Description 

This  chapter  describes  the  system  architecture  of  ArchOS  which  supports  all  of  the  facilities  and 
primitives  described  in  the  client  interface  document  [Jensen  85].  Since  we  view  "ArchOS  system 
architecture"  as  a  high-level  view  of  the  internal  system  design,  this  chapter  focuses  on  the 
description  of  the  major  components  of  ArchOS  operating  system  and  leaves  the  detailed  design  to 
the  next  chapter. 

From  a  conceptual  point  of  view,  ArchOS  consists  of  the  ArchOS  kernel  and  system  arobjects  which 
are  grouped  together  to  form  subsystems.  The  ArchOS  kernel  provides  a  set  of  basic  mechanisms 
which  can  support  arobjects  in  both  kernel  and  client  address  spaces.  An  ArchOS  subsystem 
consists  of  one  or  more  system  arobjects  which  provides  an  ArchOS  facility.  In  other  words,  the 
ArchOS  primitives  described  in  the  client  interface  document  are  defined  as  a  set  of  operations  of 
these  system  arobjects. 

2.1  Overview 

ArchOS  is  not  a  time-sharing  operating  system  nor  a  network  operating  system  for  a  set  of 
workstations.  ArchOS  is  a  physically  dispersed  operating  system  which  performs  decentralized 
system-wide  resource  management  for  a  decentralized  computer  which  can  be  physically  dispersed 
across  10'^  to  10^  meters,  interconnected  without  the  use  of  shared  primary  memory. 

We  view  ArchOS  as  a  reserach  vehicle  to  perform  research  on  the  issues  of  decentralization  in 
real-time  distributed  systems  at  the  OS  level  and  below.  Thus,  the  ArchOS  facilities  were  designed  to 
allow  test  applications  to  be  constructed  with  which  to  study  ArchOS  characteristics,  but  they  need 
not  provide  a  particularly  complete  set  of  facilities  in  an  application  production  environment.  As  we 
specified  in  the  client  interface  document,  some  system  facilities  are  not  fleshed  out,  but  each  facility 
is  functionally  closed  so  that  each  function  can  be  evaluated  with  respect  to  our  research  issues. 

The  basic  model  for  system-wide  resource  management  in  ArchOS  is  based  on  our  previous 
conceptual/theoretical  study  and  experimental  study.  In  particular,  we  arc  interested  in 
decentralized  resource  management  schemes  where  each  global  decision  is  made  multilaterally  by  a 
group  of  peers  through  negotiation,  compromise,  and  consensus.  Thus,  ArchOS  facilities  are 
provided  by  a  group  of  cooperating  servers  of  which  each  service  entity  is  built  by  an  arobjecl  on 
each  node.  Since  each  arobject  can  have  its  own  state  and  can  activate  computation  independently 
of  other  arobjects.  cooperation  among  servers  can  be  easily  represented  by  a  set  of  arobjects. 


132 


Furthermore,  an  arobject  cannot  fetch  or  modify  the  status  of  any  other  arobject  without  invoking  its 
corresponding  operation  at  the  target  arobject,  thus  the  modularity  of  arobjects  can  also  be  achieved 
as  well  as  potential  parallelism  between  the  caller  and  called  arobjects. 

From  the  higher-level  point  of  view,  the  overall  structure  of  ArchOS  is  divided  into  two  system 
components.  One  component  is  an  ArchOS  kernel  which  provides  a  set  of  basic  mechanisms  and 
links  the  higher-level  policy  with  the  mechanisms,and  creates  an  arobject  environment.  The  other  is  a 
set  of  system  arobjects  which  implements  ArchOS  system  facilities.  A  system  arobject  can  be  further 
distinguished  by  where  it  resides,  namely  a  kernel  arobject  which  exists  in  the  kernel  and  non-kernel 
arobjects  which  exist  outside  the  kernel. 

Since  an  ArchOS  facility  is  built  based  on  cooperating  arobjects,  it  is  easy  tp  change  the  internal 
implementation  without  impacting  client  programs.  Furthermore,  certain  facilities  are  also  built  based 
on  the  notion  of  policy/mechanism  separation  [Wulf  81],  so  that  an  existing  policy  can  be  changed  or 
a  new  type  of  policy  can  be  created  without  changing  mechanisms.  Thus,  ArchOS  facilities  can  be 
easily  tuned  towards  a  particular  type  of  application  environment  without  changing  the  mechanisms 
in  the  system. 

2.2  ArchOS  System  Architecture 

The  system  architecture  of  ArchOS  should  be  flexible  enough  so  that  various  degrees  of 
decentralized  resource  management  schemes  can  be  applied  in  ArchOS  without  incurring  major 
modification  cost.  Since  the  Archons  project  is  not  attempting  to  develop  a  computing  facility,  we 
must  consider  "openess"  of  the  system  architecture. 

The  system  structure  of  ArchOS  was  designed  to  improve  the  robustness,  extensibility,  and 
modularity  of  OS  functions.  There  is  no  centralized  decision  maker  in  the  system  for  any  type  of 
system-wide  resource  management.  That  is,  any  system-wide  resource  management  is  performed  by 
the  cooperation  among  the  essentially  identical  peer  server  modules  at  each  node.  These  modules 
allow  the  use  of  the  "most  decentralized"  algorithms  (that  are  practical)  for  managing  system 
resources. 

The  ArchOS  system  architecture  was  designed  without  assuming  any  specific  hardware-level 
structures  such  as  node  architecture  and  communication  architecture.  However,  to  maximize 
performance,  it  is  preferable  to  have  the  following  architectural  support:  1)  Each  node  should  have 
two  or  more  processors  with  a  reasonable  amount  of  shared  memory;  2)  Each  node  can  communicate 


133 


with  another  node,  a  set  of  nodes  (i.e.,  multicasting),  or  all  of  the  nodes  within  a  reasonable  amount  of 
time:  3)  Each  node  should  have  a  logically  homogeneous  environment.  That  is,  even  if  the  processor 
architectures  are  different,  all  of  the  nodes  can  emulate  the  ArchOS  functionality. 


2.2.1  Objectives 

The  major  objectives  are  derived  from  the  research  requirements  and  the  functionality  of  ArchOS. 
In  particular,  ArchOS  is  designed  to  be  a  research  vehicle  for  investigating  various  decentralized 
resource  management  techniques,  so  that  various  types  of  OS  facilities  can  be  added,  changed  or 
deleted. 


The  design  objectives  of  ArchOS  system  architecture  are  summarized  as  follows: 

•  Open  system  architecture: 

Since  ArchOS  is  designed  as  a  research  vehicle  to  investigate  various  decentralization 
issues  at  operating  system  level  and  below,  a  new  system  component  can  be  easily  added 
as  well  as  deleting  the  existing  facility. 

•  Highly  decentralized  resource  management: 

ArchOS  should  not  have  any  centralized  decision  maker  in  the  system  and  should  be 
structured  as  essentially  identical  peer  modules  replicated  at  each  system  node. 

•  High  system  availability: 

To  maintain  high  system  availability  in  the  face  of  node  and  communications  faults, 
provide  for  no  lasting  degradation  of  system  function  or  response  beyond  the  actual 
resource  loss  encountered. 

•  High  system  modularity: 

Previous  distributed  systems  have  been  constructed  around  the  physical 
communications  limitations  of  the  system,  as  opposed  to  being  based  on  such  modern 
software  engineering  principles  as  abstract  data  types  and  information  hiding  for  defining 
modularity. 

•  Highly  extensible  facilities: 

To  construct  highly  extensible  facilities  which  will  limit  the  cost  of  redesign  for 
implementing  widely  divergent  operating  system  facilities  for  experimentation  with  the 
fundamental  concepts  of  interest  to  us  in  this  research. 

•  Reasonable  system  performance: 

To  provide  reasonable  system  performance  to  support  applications  with  real-time 
constraints.  ArchOS  must  consider  time-critical  functions  for  which  time  constraints 
define  some  system  failure  modes  (i.e.,  issues  of  timeliness  will  not  be  ignored  as  they  are 
in  some  systems,  nor  will  they  be  dealt  with  by  simply  providing  a  large  ratio  of  available 
to  currently  used  resources,  which  is  how  virtually  all  other  real-time  systems  strive  to 
meet  response  time  requirements). 


134 


2.2.2  Basic  Approach 

A  basic  approach  to  meet  the  previous  objectives  is  that  vie  provide  a  uniform  view  of  the  system 
components,  namely  a  kernel  and  a  set  of  arobjects.  An  ArchOS  kernel  provides  basic  mechanisms 
for  ArchOS  facilities  and  the  cooperating  arobjects  support  the  policies  to  provide  a  particular  set  of 
services.  While  we  avoid  a  large  monolithic  kernel,  we  try  to  improve  robustness,  modularity,  and 
extensibility  of  each  ArchOS  facility  as  well  as  increasing  potential  parallelism  by  using  the 
cooperating  arobjects. 


To  meet  the  previous  objectives,  we  will  use  the  following  approaches: 

•  Open  system  architecture: 

We  will  build  ArchOS  based  on  an  object-oriented  architecture  view,  so  that  the  ArchOS 
kernel  is  not  a  monolithic  kernel  and  consists  of  a  set  of  baisic  mechanisms  and  kernel 
arobjects.  On  top  of  the  kernel,  ArchOS  facilities  are  provided  by  cooperating  au’objects. 
Since  an  arobject  encapsulates  its  associated  information  (information  hiding),  key 
properties  of  each  facility  can  be  easily  modified  without  affecting  the  other  facilities. 
Furthermore,  additional  system  flexibility  will  be  provided  by  identifying  and  separating 
mechanism  and  policy  in  ArchOS  facilities. 

•  Highly  decentralized  resource  management: 

Since  we  try  to  avoid  having  any  type  of  centralized  decision  maker  for  system-wide 
resources  and  each  arobject  can  have  its  own  computational  state  and  knowledge,  it  is 
easy  to  form  cooperation  among  peer  modules  in  the  system.  Unlike  traditional 
procedure  invocation  in  abstract  data  types,  a  caller  cannot  invoke  an  operation  on  an 
arobject  in  a  master-slave  manner,  but  must  use  a  form  of  rendezvous.  Based  on  its  own 
state  and  decision,  the  receiver  arobject  has  the  right  to  accept,  reject,  or  delay  the 
invocation  of  the  requested  operation. 

•  High  system  availability: 

The  use  of  atomic  transactions  to  maintain  consistency  will  result  in  continuance  of 
useful  computation  in  the  presence  of  lost,  delayed,  inaccurate,  or  incomplete 
information.  Also,  a  service  handled  by  a  set  of  replicated  arobjects  can  be  easily 
supported  by  using  the  one-to-many  communication  capability  together  with 
transactions. 

•  High  system  modularity: 

Since  an  arobject  can  encapsulate  its  state  and  implementation  details  from  the  other 
arobjects,  it  is  easy  to  use  it  as  a  system  building  block.  System  modularity  is  also 
achieved  not  only  by  using  arobjects,  but  also  through  policy/mechanism  separation  in 
ArchOS  facilities. 

•  Highly  extensible  facilities; 

ArchOS  facilities  are  implemented  as  a  set  of  system  arobjects,  thus  it  is  possible  to 
create  a  new  type  of  service  by  creating  a  new  set  of  arobjects  and  registering  them. 

•  Reasonable  system  performance; 

To  provide  reasonable  system  performance,  ArchOS  subsystems  are  built  based  on  a  set 


135 


of  arobjects  which  can  be  executed  on  a  multiprocessor  based  node  without  major 
redesign. 

A  conceptual  view  of  the  ArchOS  system  architecture  is  depicted  in  Figure  2-1.  On  each  node  built 
from  one  or  more  processors  with  shared  memory,  there  are  ArchOS  kernel  and  system  arobjects. 


Cl  lent  Arobjects : 


ArchOS: 

1  1 

1  1 

□ 

Subsystems : 

1  1 

1  1 

1  1 

Kernel 
Arobjects : 

— 

•  •  “ 

— 

1  1 

1  1 

1  1 

Base  Kernel : 

i 

1 

*2 

1 

Km  1 

Node  Processors: 

Nl 

“2 

... 

Nm 

FIgu  re  2- 1 :  Overview  of  ArchOS  System  Architecture 


The  ArchOS  kernel  consists  of  a  set  of  basic  mechanisms  and-  a  set  of  system  arobjects  which 
reside  within  a  kernel  domain.  No  system  arobjects  can  be  accessed  without  invoking  the  ArchOS 
primitives.  A  client  process  must  use  a  communication  primitive,  request,  to  invoke  any  system 
primitives.  ArchOS  primitives  defined  in  the  client  interface  are  provided  either  as  kernel  primitives  or 
system  primitives.  Kernel  primitives  are  defined  as  an  operation  on  kernel  arobjects,  while  system 
primitives  are  provided  by  an  operation  on  the  system  arobject  which  exists  above  the  kernel. 

ArchOS  Subsystems  are  one  or  more  aroojects  grouped  together  by  sharing  the  same  responsibility 
for  providing  a  particular  service,  namely  a  set  of  functions  in  an  ArchOS  facility. 

A  system  arobject  can  be  a  kernel  arobject  or  an  ordinary  arobject  which  resides  in  the  user 
domain.  A  client  arobject  can  be  created  on  top  of  the  kernel  and  resides  in  the  user  domain. 


136 


2.2.3  Functional  Dependency  among  ArchOS  Subsystems 
The  relationship  among  ArchOS  subsystems  can  be  summarized  by  abstracting  out  the  detailed 
interact  on  among  arobjects  across  the  various  subsystems.  Since  each  subsystem  consists  of  a  set 
of  cooperating  arobjects,  many  activities  are  initiated  by  invoking  a  particular  operation  in  the 
arobject.  This  section  focuses  on  the  functional  dependency  among  ArchOS  subsystems,  rather  than 
on  the  detailed  interaction  among  arobjects.  The  description  of  the  internal  structure  of  each 
subsystem  will  be  given  in  the  next  chapter. 

From  a  functional  point  of  view,  ArchOS  subsystems  can  be  depicted  as  in  Figure  2-2. 


ArchOS  Sybsystems: 
ArchOS  Kernel : 


Arobject/ 

Communication 

File 

Monitoring/ 

Process 

Subs  vs . 

Subsys. 

Subsys . 

Oebuag i no 
Subsys . 

Kernel  Arobjects: 


ArchOS  Base  Kernel 


Transaction  Subsys. 

T ime-driven 

VM  Subsys. 

T ime-driven 

Scheduler 

Subsys. 

Page  Set 
Subsys . 

Policy 

Subsys. 

I/O  Subsys. 

— 

Base  Kernel 


Node  Processors: 


Hardware 


Figure  2-2:  Functional  Dependency  among  ArchOS  .Subsystems 


In  Figure  2-2,  each  box  represents  a  subsystem  in  ArchOS.  However,  unlike  a  traditional 
hierarchical  layered  structured  system,  an  operation  at  a  lower-level  subsystem  can  be  directly 
invoked  from  a  client  arobject. 


137 


The  ArchOS  Kernel  provides  a  set  of  basic  mechanisms  and  consists  of  a  set  of  system  arobjects 
which  reside  within  a  kernel  domain.  The  portion  of  the  basic  mechanisms  is  called  the  Base  kernel. 
Unlike  a  traditional  monolithic  kernel,  the  ArchOS  kernel  contains  a  set  of  of  system  arobjects,  which 
we  refer  to  as  kernel  arobjects,  and  it  can  initiate  independent  computation  from  client  arobjects.  A 
certain  set  of  basic  mechanisms  are  realized  to  coordinate  with  the  policy  definition  module,  so  that 
facilities  may  provide  a  different  type  of  functionality  to  clients. 

The  Policy  management  subsystem  maintains  a  user-definable  system  module,  called  the  policy 
definition  module  which  consists  of  a  policy  body  and  a  set  of  policy  attributes.  Since  a  policy 
definition  module  can  be  placed  in  the  kernel,  a  system  arobject,  or  a  client  arobject,  the  policy 
management  subsystem  creates  and  destroys  a  policy  definition  descriptor  which  indicates  the 
location  of  the  policy  definition  and  the  information  related  to  the  policy  body  and  attributed  set. 

The  Page  set  subsystem  provides  a  uniform  view  of  secondary  storage,  namely  a  page  set  in 
ArchOS.  An  arobject  can  access  a  file  through  the  page  set  and  logical  disk  subsystems.  On  each 
node,  there  is  a  page  set  manager  which  can  allocate  or  deallocate  a  permanent  type  or  atomic  type 
of  page  set  on  a  specified  logical  disk. 

The  I/O  device  subsystem  manages  interaction  between  I/O  devices  and  arobjects.  Any  signal 
from  I/O  devices  is  handled  at  this  subsystem  and  is  translated  into  an  invocation  request  for  a 
system  arobject. 

The  System  monitoring  and  debugging  subsystem  provides  various  abilities  to  monitor  and  control 
the  behavior  of  cooperating  arobjects  and  their  processes  during  execution. 

The  Time-driven  scheduler  subsystem  provides  a  time-driven  scheduling  facility  which  may  be 
altered  by  adding,  ch.nnging  or  .selecting  a  different  type  of  scheduling  policy.  The  basic  mechanism 
is  based  on  ArchOS’  "best  effort"  scheduling  model  [Locke  85].  This  subsystem  also  provides 
functions  such  as  obtaining  time  information  and  setting/resetting  scheduling  parameters  for  clients. 

The  Time-driven  virtual  memory  subsystem  provides  a  virtual  memory  management  facility  in  a 
time-driven  fashion.  The  time-driven  scheduler  may  be  called  from  this  subsystem  to  initiate  pre¬ 
paging  in/out  activities  in  such  a  way  that  the  system  can  execute  time-critical  tasks  in  a  timely 
manner. 

The  Transaction  subsystem  provides  transaction  mechanisms  for  operations  on  arbitrary  types  of 
arobjects.  This  subsystem  supports  two  types  of  transactions,  namely  elementary  and  compound 


138 


transactions  which  can  be  nested  in  arbitrary  combinations.  To  coordinate  an  atomic  update  for  a 
transaction,  the  transaction  subsystem  must  cooperate  with  one  or  more  page  set  managers  which 
the  transaction  has  visited. 

The  Arobject/ process  mar\agement  subsystem  provides  the  basic  functions  to  create  and  destroy 
arobjects  and  processes.  This  subsystem  also  manages  binding  and  unbinding  of 
arobjects/processes  reference  names  and  supports  binding  protocols  to  match  a  requestor  with  one 
or  more  suitable  server  arobjects.  The  requestor  arobject  invokes  an  operation  by  using  the 
invocation  protocol  through  the  communication  subsystem. 

The  Communication  subsystem  provides  the  invocation  protocol  and  manages  arobject  invocation 
among  arobjects.  This  subsystem  provides  not  only  one-to-one  communication  for  a  conventional 
client-(single)  server  model,  but  also  one-to-many  communication  for  the  client-cooperating  multiple 
server  model. 

The  File  subsystem  provides  system-wide  location-independent  file  access.  Although  the  file 
subsystem  does  not  support  a  hierarchical  file  name  space,  a  system-wide  flat  name  space  is 
maintained'at  each  node.  The  file  subsystem  also  provides  three  kinds  of  file  properties.  A  normal  file 
has  the  data  portion  of  the  file  arobject  in  volatile  storage.  A  permanent  file’s  data  portion  is  allocated 
in  permanent  (non-volatile)  storage  and  an  atomic  file  has  its  data  portion  declared  as  an  atomic  data 
object.  For  atomic  files,  the  transaction  facility  is  also  supported  through  the  transaction  subsystem. 

2.3  Structure  of  ArchOS  Subsystems 

An  ArchOS  subsystem  consists  of  a  set  of  system  arobjects,  called  components,  resident  on  each 
node  and  provides  a  system-wide  or  local  service  within  a  node.  Each  subsystem  was  designed 
based  on  a  generic  server  structure  for  ArchOS  in  order  to  provide  reliable  and  fast  service  of  the 
requested  system  functions. 

2.3.1  Objectives 

The  internal  structure  of  a  subsystem  should  meet  the  following  objectives: 

•  Various  types  of  decentralized  resource  management  schemes  should  be  easily 
implemented  in  a  subsystem. 

•  A  subsystem  should  be  able  to  serve  concurrent  requests  efficiently,  fairly  and  without 
deadlocks. 

•  A  subsystem  should  be  able  to  support  replicated  arobjects  to  increase  availability  of  its 
service. 


139 


•  Each  component  should  be  able  to  stop  independently  and  resume  its  activity  without 
major  disturbance. 

•  Within  a  component,  concurrent  operation  invocation  should  be  easily  provided  if  the 
service  requests  can  be  handled  simultaneously. 

2.3.2  Internal  Structure  of  Subsystems 

The  internal  structure  of  subsystems  is  based  on  a  generic  server  model  which  can  provide  system- 
wide  service  in  a  reliable  manner.  The  server  model  was  used  to  provide  the  template  for  different 
types  of  servers  to  reduce  the  design  complexity  of  subsystems. 

The  server  model  consists  of  service  protocols,  which  define  a  complete  interaction  between  a 
client  and  all  servers  providing  the  same  service,  and  a  server  structure. 

Our  view  of  system  service  at  the  subsystem  level  is  depicted  in  Rgure  2-3. 


:  Operations 

:  Processes  & 
Private  Objects 


Figure  2-3:  Relationship  among  Service,  Server,  and  Arobject 


At  the  subsystem  level,  a  reference  name  is  given  to  each  service  class  which  provides  a  particular 
service,  namely  a  set  of  functions  for  an  ArchOS  facility.  A  service  class  is  implemented  by  a  set  of 


140 


service  instances,  namely  servers.  A  server  is  formed  by  an  instance  of  an  arobject  and  is  normally 
replicated  at  each  node.  However,  a  different  set  of  servers  can  also  form  a  service  class,  since  a 
reference  name  can  be  shared  by  these  servers. 

In  general,  services  are  replicated  on  each  node,  and  system-wide  resource  management  is 
performed  by  using  the  service  protocol  to  define  interaction  between  client  and  server  as  well  as 
inter-server  communication. 

2. 3. 2.1  Service  Protocol 

A  subsystem  provides  a  service  or  a  set  of  services  for  a  client.  A  client  can  locate  a  suitable  server 
for  a  requested  service  and  get  the  result  back  by  initiating  the  service  protocol.  The  service  protocol 
consists  of  three  protocols:  binding,  invoking,  and  inter-server  protocols. 

The  binding  protocol  defines  how  a  client  determines  a  suitable  server  fot  a  requested  service  from 
a  set  of  servers.  For  instance,  we  can  apply  the  following  binding  policies  for  certain  types  of 
resource  management  such  as  creating  a  new  instance  of  an  arobject  or  a  process  in  the  system: 

•  First  Fit  (FF): 

The  first  server  from  the  matched  servers  will  be  selected  to  submit  a  service  request. 

•  Random  Fit  (RF): 

A  server  from  the  random  selection  will  be  used  to  submit  a  service  request. 

•  Best  Fit  (BF); 

One  of  the  best  matched  servers  will  be  used  to  submit  a  service  request. 

•  Best  Effort  Fit  (BEF): 

The  best  matched  server  according  to  the  best-effort  (decision  making)  algorithm  will  be 

used  to  submit  a  service  request. 

It  should  be  noted  that  RF  protocol  can  be  used  without  any  interaction  with  potential  servers,  while 
FF  protocol  may  need  at  least  one  reply  from  the  servers,  BF  needs  all  replies,  and  BEF  needs  all  or 
some  replies. 

The  invoking  protocol  defines  how  a  client  can  invoke  an  operation  at  a  specific  or  non-specific 
server(s)  and  can  get  the  service  result.  In  ArchOS,  a  client  arobject  can  invoke  an  operation 
synchronously  as  well  as  asynchronously.  Thus,  it  is  possible  for  a  client  to  invoke  an  operation  at 
multiple  servers  simultaneously. 

•  synchronous/specific  server  request: 

A  client  can  invoke  an  operation  on  a  specific  arobject  by  using  a  Request  primitive. 


141 


•  synchronous/non>specific  servers  request: 

A  client  can  invoke  an  operation  on  a  specific  service  class  by  using  a  RequestAII 
primitive  followed  by  the  necessary  GetReply  primitives. 

•  asynchronous/specific  server  request: 

A  client  can  invoke  an  operation  on  a  specific  arobject  with  blocking  itself  by  using  a 
RequestSingle  primitive. 

•  asynchronous/nori'specific  servers  request: 

A  client  can  invoke  an  operation  on  a  specific  service  class  by  using  a  RequestAII 
primitive. 

The  Inter-server  protocol  defines  how  an  individual  server  can  exchange  control  or  status 
information  to  perform  a  service.  Thus,  the  inter-server  protocol  is  dependent  upon  the  nature  of  the 
service.  For  instance,  local  resource  information  of  a  server  can  be  distributed  to  all  other  servers 
periodically  by  invoking  a  suitable  operation  for  a  service  class.  This  type  of  multicasting  can  also  be 
performed  by  using  a  requestall  primitive. 

2. 3. 2. 2  Generic  Structure  for  a  Server 

The  generic  structure  for  a  server  is  based  on  the  nature  of  service  the  server  provides  and  the 
characteristics  of  an  arobject.  Since  any  system  service  should  be  highly  available,  a  server  itself 
must  increase  its  availability  by  avoiding  n^»ressary  blocking  periods  and  deadlocks. 


Each  arobject  has  at  least  one  process  which  has  responsibility  to  create  the  necessary  task  force 
to  provide  a  service.  The  number  of  processes  may  be  created  dynamically  depending  on  the  request 
load.  On  the  other  hand,  a  certain  server  might  have  a  fixed  number  of  processes  to  handle  a  fixed 
number  of  tasks.  Thus,  the  server  structure  is  related  to  the  nature  of  the  service  which  is  provided  by 
the  cooperating  arobjects. 


Wc  can  summarize  the  generic  structure  for  a  server  in  Figure  2-4. 

•  Type  I:  a  single  process 

The  initial  process  accepts,  requests,  and  processes  them  one  at  a  time.  That  is,  the 
execution  order  of  the  operations  will  be  totally  serialized.  Each  operation  invocation  will 
be  handled  by  a  corresponding  procedure  or  function  within  the  initial  process. 

•  Type  II:  functionally  partitioned  workers 

The  initial  process  creates  a  number  of  workers  according  to  the  number  of  operations. 
Coordination  between  workers  is  handled  by  the  workers.  Each  operation  invocation  is 
accepted  by  a  specific  worker  process  assigned  to  do  the  task.  Thus,  a  worker  issues  the 
Accept  primitive  to  wait  for  a  specific  type  of  requested  operation. 


•  Type  III:  replicated  workers 


142 


:  Type  I 
Single  Process 


:  Type  II 

Functionally  Partitioned  Workers 


:  Type  III 
Replicated  Workers 


:  Type  IVa 

A  Manager  with  a  fixed  number  of 
workers 


;  Type  IVb 

A  Manager  with  a  variable  number 
of  workers 


Figu  re  2-4:  Generic  Server  Structures  for  a  Server 

The  initial  process  creates  a  number  of  replicated  workers  to  accept  "any"  operation. 
Coordination  between  workers  is  handled  by  the  workers.  Workers  can  be  placet  at 
different  nodes,  sc  parallel  execution  of  requested  operations  might  be  possible.  Each 
worker  issues  an  AcceptAny  primitive  to  wait  for  an  invocation  request 

•  Type  IVa:  a  manager  with  a  fixed  number  of  workers 

The  initial  process  creates  a  fixed  number  of  workers  as  a  service  task  force  and 
becomes  a  manager  of  them.  Coordination  between  workers  is  handled  by  the  initial 
process  and  each  worker  may  or  may  not  be  assigned  the  same  task. 

•  Type  IVb:  a  manager  with  variable  number  of  workers 

The  initial  process  acts  as  a  manager  of  workers  and  accepts  all  incoming  invocation 
requests  and  creates  necessary  workers  on  demand.  Coordination  between  v/orkers  is 
handled  by  the  initial  process  and  each  worker  may  or  may  not  be  given  the  same 
functionality. 


143 


3.  ArchOS  System  Design  Description 

This  chapter  contains  the  ArchOS  system  design  description  and  describes  the  internal  architecture 
of  each  ArchOS  subsystem.  From  a  conceptual  point  of  view,  the  ArchOS  operating  system  consists 
of  the  ArchOS  kernel  and  a  set  of  subsystems  composed  of  a  set  of  cooperating  system  arobjects. 
While  the  ArchOS  system  architecture  description  in  the  previous  chapter  discussed  higher-level 
views  of  the  system  structure,  this  chapter  focuses  on  how  each  subsystem  is  built  based  on 
cooperating  arobjects. 

3.1  Overview 

The  ArchOS  system  design  description  discusses  the  internal  architecture  of  ArchOS  focusing  on 
the  ArchOS  kernel  and  subsystems.  The  ArchOS  kernel  provides  basic  mechanisms  for  ArchOS 
facilities  and  each  subsystem  implements  each  service  facility  of  ArchOS. 

As  discussed  in  the  previous  chapter,  the  ArchOS  kernel  consists  of  two  layers:  the  lov/er  layer  is 
the  Base  and  the  higher  layer  is  a  setof  kernel  arobjects.  On  top  of  the  base  kernel,  a  kernel  arobject 
can  be  created  to  increase  the  concurrency  within  the  base  kernel.  ArchOS  provides  a  system-wide 
arobject  environment  on  top  of  the  ArchOS  kernel.  Since  ArchOS  facilities  are  implemented  by 
cooperating  system  arobjects,  they  can  be  modified,  added,  or  deleted  without  having  a  major  cost. 

A  subsystem  consists  of  a  set  of  cooperating  arobjects  which  can  provide  the  actual  services  to 
clients.  In  general,  each  component  of  the  subsystem  is  replicated  on  each  node,  so  a  system-wide 
service  is  provided  as  the  result  of  interaction  among  components.  Each  component  also  has  the 
responsibility  to  recover  from  so-called  "clean  and  soft  failure"  [Bernstein  83]  at  a  node.^The  chosen 
recovery  sequence  maintains  the  consistent  state  of  the  server. 

In  the  following  sections,  we  describe  the  ArchOS  kernel  and  each  ArchOS  subsystem  focusing  on 
its  internal  structure,  key  algorithms,  and  inter-server  protocols. 


3 

Wo  limit  our  discussion  of  resiliency  against  "clean  and  soft"  failure  in  which  some  of  norles  of  the  system  simply  stop 
running  and  loose  the  contents  of  main  memory,  but  the  contents  of  stable  storage  used  by  the  failed  nodes  (computers) 
rem.ain  intact. 


144 


3.2  ArchOS  Kernel 

3.2.1  Overview  of  ArchOS  Kernel 

The  ArchOS  kernel  is  a  collection  of  low-level  mechanisms  and  kernel  arobjects  which  provides 
basic  service  facilities  to  support  the  ArchOS  operating  system.  In  order  to  increase  its  flexibility,  the 
ArchOS  base  kernel  only  provides  time  critical  system  functions  such  as  time-driven  scheduler,  kernel 
memory  manager  and  kernel  arobject/process  manager.  The  other  low-level  OS  functions  such  as 
transaction  subsystem,  arobject/process  subsystem,  communication  subsystems  can  be  created  on 
top  of  the  base  kernel.  Further  more,  a  certain  type  of  system  function  such  monitoring/debugging 
subsystems  are  implemented  by  system  arobjects  which  can  be  created  on  top  of  the  ArchOS  kernel. 
As  a  result,  ArchOS  primitives  defined  in  the  client  interface  are  implemented  either  as  a  kernel 
primitive,  an  operation  on  kernel  arobjects,  or  an  operation  on  system  arobjects. 

Internally,  the  ArchOS  kernel  consists  of  a  base  kernel  and  a  set  of  kernel  arobjects  which  can 
create  a  kernel  arobject  environment.  Even  though  the  base  kernel  will  be  built  ais  a  collection  of 
independent  low-level  modules,  an  object-oriented  view  must  be  maintained.  It  is  a  part  of  our  study 
to  explore  the  minimum  structure  of  the  base  kernel  as  well  as  the  efficient  structure  the  base  kernel. 

Figure  3-1  shows  logical  structure  of  ArchOS  Kernel. 

3.2.2  ArchOS  Base  Kernel 

The  ArchOS  base  kernel  provides  basic  mechanisms  for  ArchOS  facilities,  thus  creating  a  uniform 
arobject  environment  for  system  arobjects.  Since  the  ArchOS  kernel  is  not  a  traditional  monolithic 
kernel,  a  set  of  kernel  arbojects  are  running  on  the  top  of  the  base  kernel  and  supporting  necessary 
low- level  functions  for  ArchOS  facilities. 

The  major  functionality  of  the  base  kernel  can  be  summarized  as  follows: 

•  Creation  and  destruction  of  kernel  arobjects  and  processes 

•  Communication  between  kernel  arobjects 

•  Dispatching  mechanism  for  the  time-driven  scheduler 

•  Address  space  management  for  time-driven  virtual  memory  manager 

•  Low-level  synchronization  between  processes  and  interrupt/trap 


In  other  workds,  the  basic  kernel  is  responsible  to  perform  local  resource  management  for  creating 


145 


ArchOS  Cl: 


System  Arobjects: 


-  Atomic/Permanent/Normal  File  Arobject 

-  . tor , ng, 'Debugging  Subsys. 


ArchOS  Kernel : 


Kernel  Arobjects: 


B 


-  Monitoring/Debugging  Management 

-  File  Subsys. 

-  Communication  Subsys. 

-  Arob ject/Process  Subayi. 

-  Transaction  Subsy. 

-  Time-drien  VM  Subsys. 

-  Page  Set  Subsys. 

-  Policy  Subshys. 

-  T/0  Subsys. 


Base  Kernel : 


-  Kernel  Aroject/Process  Management 

-  Arobject  Operation  Invocation 
*  Policy  Oeflnision  Module 

-  Time-driven  Scheduler 

-  Dispatcher 

-  Kernel  Mmeory  Manager 

-  Interrupt/Trap 


Figu  re  3- 1 :  Logical  Structure  of  ArchOS  Kernel 


and  destructing  kernel  objects  and  for  invoking  its  operations.  Any  remote  or  system-wide  resource 
management  decisions  are  decided  based  on  coordination  among  kernel  arobjects. 

3.2.2. 1  Kernel  Arobjects  and  Processes 

The  ArchOS  base  kernel  provides  creation  and  destruction  of  a  kernel  arobject  in  its  node.  A  kernel 
arobject  is  similar  to  a  normal  arobject  except  that  it  resides  in  a  kernel  address  space  and  shares  the 
address  space  among  the  other  kernel  arobjects.  Similarly,  all  of  the  kernel  processes  share  their 
address  space  even  though  each  process  logically  belongs  to  a  specific  kernel  arobject. 

Binding  of  a  kernel  arboject/process  and  a  reference  name  is  also  managed  in  the  same  way. 
Unlike  a  client’s  reference  name,  a  kernel  arobject’s  reference  name  is  treated  as  a  "well-known" 
name  representing  a  service  class. 

The  follov/ing  normal  ArchOS  primitives  are  used  to  manage  kernel  arobjects  and  processes. 


146 


arobject-id  =  CreateArojbect{arobi-name[,  init-msg]) 
process-id  =  CreateProcess(process  name  [,  init-msg]) 
val  =  KiilArobject(aid) 
val  =  KillProcess(pid) 

aid  =  SelfAidO 

paid  =  ParentAid(aid) 

pid  =  SelfPidO 

ppid  =  ParentPid(pid-x) 

val  =  BindArobjectName(aid,  arobj-refname) 
val  =  BindProcessName(pid,  process-refname) 
val  =  UnbindArobjectName(aid,  arobj-refname) 
val  =  UnbindProcessName(pid,  process-refname) 

aid  =  FindAid(arobj-refname) 
pid  =  FindPid(process-refname) 
aid-list  =  FindAllAid(arobj-refname) 
pid-list  =  FindAIIPid{process-refname) 


AID  arobject-id  The  unique  identification  of  the  instantiated  arobject. 

PID  process-id  The  unique  identification  of  the  instantiated  process. 

AROBJ-NAME  arobj-name 

The  name  of  arobject  to  be  instantiated. 

PROCESS-NAME  process-name 

The  name  of  process  to  be  instantiated. 

MESSAGE  ‘init-msg 

A  pointer  to  the  initial  message  which  contains  initial  parameters  for  the  INITIAL 
process. 

AROBJ-REFNAME  arobj-refname 

The  requested  reference  name  for  an  arobject  given  by  aid. 

PROCESS-REFNAME  process-refname 

The  requested  reference  name  for  a  process  given  by  pid. 

AID-LIST  aid-list  The  list  of  corresponding  aid’s. 

PID-LIST  pid-list  The  list  of  corresponding  pid’s. 

AROBJ-REFNAME  arobj-refname 

The  reference  name  of  related  arobject(s). 

PROCESS-REFNAME  process-refname 

The  reference  name  of  related  process(es). 


147 


It  should  be  noted  that  the  above  primitives  are  identical  to  the  ordinary  ArchOS  primitives  clients 
can  use,  but  only  local  creation  and  destruction  of  kernel  arobjects  and  processes  are  supported  in 
the  ArchOS  Kernel.  When  an  instance  of  a  kernel  arobject  or  process  is  created,  an  ordinary  arobject 
or  process  descriptor  will  be  created  and  managed  uniformly  at  the  kernel.  (The  description  of  the 
arobject/process  descriptor  is  given  in  Section  AROBJECT). 

3. 2. 2. 2  Kernel  Communication  Management 

The  ArchOS  base  kernel  provides  a  local  communication  mechanism  among  kernel  arobjects.  A 
kernel  arobject  can  invoke  an  operation  at  a  single  destination  arobject  or  at  multiple  arobjects 
providing  the  same  service  name  (i.e.,  sharing  the  same  reference  name). 


trans-id  =  Request(arobj-id,  opr,  msg,  reply-msg) 
trans-id  =  RequestSingle(arobj-id,  opr,  msg) 
trans-id  =  RequestAII(arobject-name,  opr,  msg) 

pid  =  GetReply(trans-id,  reply-msg) 

(trans-id,  requestor,  opr)  =  AcceptAny(opr,  msg) 

(trans-id,  opr)  =  Accept(requestor,  opr,  msg) 

ptr-mds  3  CheckMessageQ(qtype,  requestor,  opr,  req-trans-id) 

trans-id  =  Reply(pid,  req-trans-id,  reply-msg) 
transaction-id  trans-id 

The  transaction  id  of  the  transaction  on  whose  behalf  the  request  is  being  made. 
AID  arobj-id  The  unique  id  of  the  receiving  arobject. 

OPE-SELECTOR  opr 

The  name  of  the  operation  to  be  performed. 

MESSAGE  *msg  A  pointer  to  the  message  which  contains  the  parameters  of  the  operation  to  be 
performed.  The  message  to  the  destination  arobject  must  not  contain  any 
pointers  (i.e.,  call-by-value  semantics  must  be  used). 

REPLY-MSG  *reply-msg 

A  pointer  to  the  reply  message. 

MSG-DESCRIPTORS  'pil-mds 

Pointer  to  a  list  of  the  message  descriptors  selected  by  the  specified  selection 
criteria. 


148 


MSG-Q  qtype  This  indicates  either  “request-"  or  "reply-"  message  queue. 

AID  requestor  The  aud  of  the  requesting  arobject. 

OPE  SELECTOR  opr 

The  operation  to  be  performed.  The  "opr"  parameter  can  be  a  specific  operation 
name  or  "ANYOPR". 

TRANSACTION- ID  req-trans-id 

The  transaction  id  of  the  corresponding  RequestSingle  or  RequestAII  primitive. 


3. 2. 2. 3  Policy  Management 

The  policy  management  provides  system  functions  to  add,  delete,  and  modify  the  policy  definition 
module-  in  ArchOS.  Since  the  placement  of  the  policy  definition  module  is  a  major  issue  in  terms  of 
the  system  performance,  ArchOS  allows  a  client  to  specify  the  location  by  using  a  policy  definition 
descriptor.  The  policy  definition  module  consists  of  a  policy  body  and  a  set  of  policy  attributes.  Both 
the  policy  body  and  attributes  can  be  modified  at  runtime. 


val  =  SetPoncy(policy-name,  policy-def-desc) 

val  s  SetAttributc(policy-name,  attribute-name,  attribute-value) 

pdd  =  AllocatePDOO 
val  a  FreePDO(pdd) 

BOOLEAN  val  TRUE  if  the  specified  policy  was  set  properly;  otherwise  FALSE. 
PDD  policy-def-desc 

A  pointer  to  the  policy  definition  descriptlor. 

ATTRIBUTE-NAME  attribute-name 

The  name  of  attribute  to  be  set. 

ATTRIBUTE- VALUE  attribute-value 

The  actual  value  for  the  attribute. 


The  SeiPolicy  primitive  links  a  user-defined  policy  definition  module  to  the  ArchOS  base  kernel.  The 
SetAttribute  primitive  set  a  specific  value(s)  for  its  one  of  attribute.  Since  a  policy  definition  descriptor 
is  maintained  in  the  base  kernel,  any  access  to  the  actual  policy  body  or  a  value  of  its  attribute  can  be 
easily  made. 


149 


The  AllocatePDD  primitive  allocates  a  policy  definition  descriptor  in  the  base  kernel  and  FreePDD 
releases  the  allocated  descriptor. 


1 


ArchOS  K*rn«1 


East  Karnal 


Policy  dtf.  modult  A 


Figure  3*2:  Policy  Definition  Module  and  PDD 


3. 2. 2. 4  Time-driven  Scheduling  Management 
The  interface  between  the  scheduler  and  the  remainder  of  the  ArchOS  kernel  is  a  simple  one  in 
which  the  scheduler  acts  as  a  simple  object  providing  operations  and  maintaining  its  own  scheduling 
data  base,  including  its  scheduling  queue.  These  operations,  or  course,  will  not  be  available  to  the 
ArchOS  client,  but  will  be  used  internally  by  ArchOS  to  generate  scheduling  requests  whenever 
needed.  The  operations  defined  by  the  scheduler  are  as  follows. 

pidlist  =  ScheduleO 

pid-list  s  RequestSwapLIstO 

SetSchedulelnfo(pid) 

Deschedule(pid) 

PID  LIST  pid  list  The  list  of  pids. 


150 


PID  process- id  The  unique  identification  of  the  instantiated  process. 


The  ScheduleO  primitive  returns  a  list  of  the  process  ids  (pids).  The  first  pid  indicates  a  process  to 
be  executed  at  this  time  and  the  following  pids  are  candidates  for  the  following  scheduling  point.  No 
parameters  are  passed,  and  the  scheduler  makes  its  decision  directly  from  the  information  within  its 
data  base. 

The  RequestSwapListO  primitive  returns  a  list  of  the  pids  which  indicates  candidates  for  the 
following  swapping  decisions  at  the  virtual  memory  manager  level. 

The  SetScheduleinfo  primitive  enters  the  necessary  scheduling  information  for  the  specified 
process  to  the  scheduler.  The  scheduler  places  the  process  pid  and  all  its  scheduling  parameters 
into  its  database  and  prepares  for  making  the  scheduling  decision  required  when  the  next  Schedule 
operation  is  performed.  This  decision-making  process  operates  continuously,  concurrently  with  the 
application  processing,  preparing  its  next  scheduling  decision. 

The  Deschedule  primitive  removes  a  process  pid  from  its  queue,  updating  its  current  scheduling 
database  to  prepare  for  the  next  Schedule  operation. 

The  Scheduler  will  use  the  interrupt  mechanism  in  the  processor  running  the  ArchOS  kernel  to 
invoke  ether  kernel  mechanisms  when  it  makes  relocation  decisions  or  must  make  decisions 
regarding  process  scheduling  with  respect  to  other  nodes.  In  addition,  the  interrupt  mechanism  will 
be  used  to  interrupt  the  kernel  and  application  processing  when  sufficient  time  has  elapsed  that  the 
current  scheduling  decision  must  be  reconsidered. 

3. 2. 2. 5  Address  Space  Management 

The  base  kernel  provides  a  number  of  primitive,  though  powerful  mechanisms  for  constructing, 
manipulating,  and  accessing  the  definitions  of  the  virtual  address  spaces  of  processes.  A  process’ 
virtual  address  space  (or  simply  address  space)  is  illustrated  in  Figure  3-3.  An  address  space  consists 
of  four  non-overlapping  regions'. 

1.  Kernel  Region: 

The  Kernel  Region  is  identical  for  all  processes,  and  is  only  accessible  when  executing  in 
kernel  mode.  It  contains  all  of  the  kernel  code  and  data  structures.  The  kernel  is  shared 
by  all  processes  and  (at  least  in  the  initial  version  of  ArchOS)  will  be  non-pageable. 

2.  Private  Region: 

The  Private  Region  is  unique  for  each  process,  and  it  consists  of  five  segments:  Kernel 


151 


Stack  Segment,  User  Stack  Segment,  User  Heap  Segment,  User  Data  Segment,  and  User 
Text  Segment.  Each  of  these  segments  will  be  discussed  below. 

3.  Shared  Region: 

The  Shared  Region  is  unique  for  each  arobiect,  but  shared  by  all  processes  within  an 
arobject.  This  region  contains  ail  of  the  arobject's  private  abstract  data  type  instances, 
along  with  the  code  for  accessing  and  manipulating  them.  It  consists  of  a  variable 
number  of  segments,  one  (or  more)  for  each  instance  of  a  shared  abstract  data  type,  plus 
a  Shared  Text  Segment  and  Shared  Headers  Segment.  Each  of  these  different  types  of 
segments  will  be  discussed  below. 

4.  Kernel  Interface  Region: 

The  Kernel  Interface  Region  is  quite  small,  and  primarily  contains  the  code  and  data 
areas  needed  for  switching  between  user  and  kernel  modes.  This  region  is  shared  by  all 
processes  and  is  non -pageable. 


The  Kernel  Region  and  Kernel  Interface  Region  are  of  fixed  size,  and  the  virtual  to  physical  address 
maps  for  these  regions,  as  well  as  their  protection  attributes,  do  not  vary  from  address  space  to 
address  space  (process  to  process),  or  during  the  lifetime  of  an  address  space.  As  a  result,  these  two 
regions  are  constructed  automatically  whenever  a  new  address  space  is  created,  and  never  altered 
thereafter.  The  remaining  portion  of  an  address  space  is  comprised  of  the  Private  Region  and  Shared 
Region,  The  relative  sizes  of  these  two  regions  can  vary  from  arob/ect  to  arobject.  However,  within 
an  arobject,  all  processes  share  the  same  Shared  Region,  and  hence  all  processes  within  a  single 
arobject  will  have  identically  sized  Private  and  Shared  Regions. 


The  Private  Region  and  Shared  Region  each  consist  of  a  number  of  (non-overlapping)  segments,  of 
varying  sizes  and  types.  The  different  types  of  segments  have  corresponding  protection  attributes. 
Some  segments  are  required  for  each  address  space,  while  others  are  optional.  Some  segments  can 
be  expanded  (assuming  they  have  space  to  grow),  while  others  have  a  fixed  size  once  allocated.  The 
relative  locations  of  the  various  segments  are  somewhat  constrained,  and  as  a  result,  the  order  in 
which  the  segments  can  be  (dynamically)  allocated  is  similarly  constrained.  Each  allocated  segment 
has  an  associated  page  set,  which  is  used  as  the  paging  area  for  that  segment  (see  Section  3.7  for  a 
description  of  page  sets).  The  type  of  page  set  associated  with  a  segment  [temporary,  permanent,  or 
atomic)  depends  upon  the  type  of  the  segment. 


Private  Region  Segments 


The  Private  Region  contains  five  distinct  segments,  whose  relative  locations  must  be  as  shown  in 
Figure  3-3: 


Kernel  Stack  Segment: 


152 


Ktrntl  Stack 


Usar  Stack 


User  Heap 
Usar  Data 


Usar  Taxt 


Shared  Headers 


Shared  Normal 
Shared  Atomic 


Kernel  Region 


Private  Region 


Shared  Region 


Shared  Permanent 


Shared  Atomic 


Shared  Permanent 


Shared  Normal 


Shared  Text 


Kernel  Interface 
Region 


Figu  ro  3-3:  Virtual  Address  Space  (One  per  Process) 


The  Kernel  Stack  Segment  has  a  fixed  size,  which  is  the  same  for  all  address  spaces,  and  is  always 
located  at  the  lop  of  the  Private  Region.  This  stack  is  only  accessible  while  in  kernel  mode,  and  it  is 
"substituted"  for  the  User  Stack  (see  belov./)  as  part  of  the  operation  of  switching  to  the  kernel.  Since 
the  kernel  is  not  pageable,  the  Kernel  Stack  Segment  is  also  not  pageable.  Like  the  Kernel  Region 
and  the  Kernel  Interface  Region,  the  Kernel  Slack  Segment  is  constructed  automatically  whenever  a 
new  address  space  is  created,  and  never  altered  thereafter. 


153 


User  Stack  Segment: 

The  User  Stack  Segment  is  a  required  segment,  and  is  located  immediately  below  the  Kernel  Stack 
Segment.  It  can  vary  in  size  from  address  space  to  address  space,  and  during  the  lifetime  of  an 
address  space.  The  User  Stack  Segment  is  the  only  segment  in  the  Private  .Region  which  grows 
downward.  All  other  segments  are  either  fixed  size  or  grow  upward  to  include  larger  virtual 
addresses.  The  User  Stack  Segment  contains  the  subroutine  stack  (including  parameters  and  local 
variables)  while  a  process  is  executing  in  user  mode.  Hence,  this  segment  must  be  both  readable  and 
writable  from  user  mode.  The  User  Stack  Segment  is  pageable  and  should  be  associated  with  a 
temporary  page  set,  which  is  to  be  used  as  the  paging  area. 

User  Text  Segment: 

The  User  Text  Segment  is  also  a  required  segment,  and  it  is  located  at  the  bottom  of  the  Private 
Region.  It  contains  the  code  to  be  executed  by  the  process,  and  hence  its  protection  is  set  to  allow 
only  execute  access  while  in  user  mode.  The  User  Text  Segment  can  vary  in  size  from  address  space 
to  address  space,  but  once  allocated  it  never  changes.^This  segment  is  pageable  and  should  be 
associated  with  the  permar)ent  page  set  which  contains  the  code  for  the  process.  Note  that  since  the 
User  Text  Segment  is  not  writeable,  only  “page-ins”  (and  no  “page-outs”)  will  ever  be  required. 
Furthermore,  in  order  to  reduce  primary  memory  requirements,  the  Time- Driven  Virtual  Memory 
Subsystem  (see  Section  3.10)  will  arrange  for  User  Text  pagds  to  be  shared  among  all  processes 
executing  the  same  code  on  a  single  node. 

User  Data  Segment: 

The  User  Data  Segment  is  not  strictly  required,  but  it  will  almost  always  be  present.  It  is  located 
immediately  above  the  User  Text  Segment,  and  should  be  allocated  after  the  User  Text  Segment  has 
been  specified.  The  User  Data  Segment  contains  the  “global”,  initialized  variables  that  are  used  by 
(and  private  to)  the  process.  It  is  both  readable  and  writable  while  in  user  mode.  The  User  Data 
Segment  can  vary  in  size  from  address  space  to  address  space,  but  once  allocated  it  never  changes. 
®This  segment  is  pageable  and  should  be  associated  with  the  permanent  page  set  which  contains  the 
initialized  data  for  the  process.  However,  only  (initial)  page-ins  will  ever  be  performed  using  this  page 
set,  so  as  not  to  destroy  the  definitions  of  the  initial  values.  All  page-outs  and  subsequent  page-ins 


4 

Due  10  limitations  in  the  l.anquage  compilers  and  linkers,  the  User  Text  Seqment  may  contain  the  code  for  the  entire 
arobiect.  rather  than  just  the  code  required  by  a  single  process.  In  that  case  the  User  Text  Segments  for  all  processes  within  a 
single  arobject  will  have  the  same  size. 

^As  in  the  case  of  User  Text  Segments,  limii.alions  in  the  langu.oge  compilers  and  linkers  may  cause  the  User  Data  Segments 
for  all  processes  within  a  single  .arobject  to  have  the  same  size. 


154 


will  be  to  the  same  temporary  page  set  used  as  the  paging  area  for  the  User  Stack  Segment.® 

User  Heap  Segment: 

The  User  Heap  Segment  is  not  strictly  required,  but  it  too  will  almost  always  be  present.  It  is  located 
immediately  above  the  User  Data  Segment,  and  should  be  allocated  after  the  User  Text  Segment  and 
User  Data  Segment  have  been  specified.  The  User  Heap  Segment  contains  the  dynamically  allocated 
variables  that  are  used  by  (and  private  to)  the  process.  It  is  both  readable  and  writable  while  in  user 
mode.  The  User  Heap  Segment  can  vary  in  size  from  address  space  to  address  space,  emd  during  the 
lifetime  of  an  address  space.  Note  that  since  the  User  Heap  Segment  grows  upward  while  the  User 
Stack  Segment  grows  downward,  expansion  of  these  two  segments  cause  additional  space  to  be 
allocated  from  opposite  ends  of  the  (single)  free  area  within  the  Private  Region.  The  User  Heap 
Segment  is  pageable  and  should  be  associated  with  the  same  temporary  page  set  which  is  used  as 
the  paging  area  for  the  User  Stack  Segment.^ 

Shared  Region  Segments 

The  Shared  Region  contains  five  distinct  types  of  segments,  although  there  can  be  multiple 
segments  of  a  particular  type.  Also,  the  placement  of  those  segments  within  the  Shared  Region  is 
somewhat  more  flexible  than  the  placement  of  the  segments  within  the  Private  Region,  The  general 
structure  of  the  Shared  Region  is  illustrated  in  Rgure  3-3.  Note  that  the  entire  Shared  Region  is 
optional,  and  will  only  be  present  if  the  arobject  definition  includes  one  or  more  private  abstract  data 
type  definitions.  Also  note  that  the  Shared  Region  is  identical  for  all  of  the  processes  (address 
spaces)  which  belong  to  a  single  arobject,  and  are  resident  on  a  particular  node.  As  a  result,  the 
Shared  Region  s-^^ould  only  be  modified  in  a  single  address  space  on  a  given  node,  in  order  to  have  all 
of  the  related  address  spaces  on  that  node  updated  simultaneously. 

The  types  of  segments  that  can  appear  in  the  Stiared  Region  ai  e  the  following: 

Shared  Text  Segment: 

There  is  a  single  Shared  Text  Segment,  and  it  must  be  present  whenever  the  Shared  Region  exists. 
It  is  located  at  the  bottom  of  the  region,  and  contains  the  code  which  defines  the  permitted  operations 


0 

Multiple  segment.'',  can  use  the  same  page  set  for  paging  purposes,  by  mapping  the  virtual  page  addresses  within  the 
various  segments  directly  onto  the  same  page  numbers  within  the  page  set,  i.c.  virtual  page  k  gets  mapped  onto  page  number 
k.  Since  segments  do  not  overlap,  the  corresponding  paging  areas  will  not  overlap  either.  See  Section  3.7  lor  more  rictails  on 
the  use  of  the  page  st.'l  facilities. 

^If  desired,  a  separate  (temporary)  p.-igo  set  could  be  used,  allowing  paging  to  different  disks,  or  even  different  nodes. 


155 


on  the  arobject's  private  abstract  data  types.  The  Shared  Text  Segment  permits  only  execute  access 
while  in  user  mode.  However,  it  is  assumed  that  this  code  will  always  be  entered  "indirectly",  by  first 
determining  which  abstract  data  type  instance  is  to  be  operated  upon,  and  then  directly  invoking  the 
requested  operation  with  the  specified  instance  as  a  parameter.  Furthermore,  it  is  assumed  that 
operations  will  only  be  directly  invoked  on  instances  which  are  located  on  the  local  node.^The  Shared 
Text  Segment  can  vary  in  size  from  address  space  to  address  space,  but  for  a  given  arobject,  once  it 
has  been  allocated  it  never  changes.  This  segment  is  pageable  and  should  be  associated  with  the 
permanent  page  set  which  contains  the  code  for  the  operations  defined  on  the  arobject's  private 
abstract  data  types.  Note  that  since  the  Shared  Text  Segment  is  not  writeable,  only  page-ins  (and  no 
page-outs)  will  ever  be  required.  Furthermore,  in  order  to  reduce  primary  memory  requirements,  the 
Time- Driven  Virtual  Memory  Subsystem  will  arrange  for  Shared  Text  pages  to  be  shared  among  all 
processes  executing  the  same  code  on  a  single  node. 

Shared  Headers  Segment; 

There  is  a  single  Shared  Headers  Segment,  and  it  too  must  be  present  whenever  the  Shared  Region 
exists.  It  is  located  at  the  top  of  the  region,  and  contains  the  “header"  information  describing  all 
instances  of  private  abstract  data  types  that  have  been  created  within  the  arobject.  This  header 
information  includes  the  node  on  which  each  instance  is  located,  and  the  address(es)  of  the  Shared 
Normal  Segment,  Shared  Permanent  Segment,  and/or  Shared  Atomic  Segment  associated  with  each 
instance.  The  header  information  is  used  whenever  operations  are  to  be  invoked  on  specified 
abstract  data  type  instances,  and  hence  the  Shared  Headers  Segment  allows  read-only  access  while 
in  user  mode  The  Shared  Headers  Segment  can  vary  in  size  from  address  space  to  address  space, 
and  during  the  lifetime  of  an  address  space.  However,  within  a  single  arobject,  ail  address  spaces  will 
have  ioentical  Shared  Headers  Segments.  The  Shared  Headers  Segment  is  the  only  segment  in  the 
Shared  Region  which  grows  downward.  All  other  segments  are  either  fixed  size  or  grow  upward  to 
include  larger  virtual  addresses.  The  Shared  Headers  Segment  is  pageable  and  should  be  associated 
with  a  temporary  page  set,  which  is  to  be  used  as  the  paging  area.  Note,  however,  that  the  Shared 
Headers  pages  are  shared  among  all  of  the  processes  of  a  single  arobject,  which  are  executing  on  a 
particular  node. 

Shared  Normal  Segments: 


Q 

Abstract  data  type  instances  can  t>e  distributed  throughout  the  nodes  of  the  computer  system,  although  a  single  instance 
v/iil  always  be  completely  contained  within  a  single  node.  The  indirect  invocation  of  an  operation  on  an  abstract  data  type 
instance  involves  first  using  the  information  in  the  Shared  Headers  Segment  to  determine  which  node  contains  the  instance  in 
question  If  the  in.stance  is  on  ihe  local  node,  the  operation  can  be  invoked  directly.  However,  if  the  inst.ance  is  on  a  remote 
node,  a  form  of  remote  proceduic  c.all  (RPC)  is  used  in  order  to  invoke  Ihe  operation  on  that  remote  node.  See  Section 
RPCSEC  for  more  details  on  the  use  of  the  RPC  lacility  to  implement  distiibuled  private  .abstract  data  types. 


156 


Th»»re  can  be  multiple  Shared  Normal  Segments  within  the  Shared  Region,  one  for  each  abstract 
data  type  instance  which  contains  “normal"  shared  data.  These  segments  can  be  located  almost 
anywhere  within  the  Shared  Region,  above  the  Shared  Text  Segment  and  below  the  Shared  Headers 
Segment.  However,  successive  segments  will  normally  be  allocated  immediately  above  the  Shared 
Text  Segment  and  any  other  existing  Shared  Normal,  Shared  Permanent,  and  Shared  Atomic 
Segments.  The  lifetime  of  a  Shared  Normal  Segment  may  be  shorter  than  the  lifetime  of  the  address 
space,  since  abstract  data  type  instances  can  be  created  and  destroyed  dynamically.  Shared  Normal 
Segments  permit  both  read  and  write  access  while  in  user  mode,  since  these  segments  contain  the 
normal  shared  variables  which  define  the  current  states  of  the  abstract  data  type  instances.  In 
addition  to  being  dynamically  created  and  destroyed,  Shared  Normal  Segments  can  be  expanded 
(grown),  to  support  the  dynamic  allocation  of  shared  normal  variables.  Note  that  since  these 
segments  grow  upward  while  the  Shared  Headers  Segment  grows  downward,  the  free  area  near  the 
top  of  the  Shared  Region  (immediately  below  the  Shared  Headers  Segment)  will  be  allocated  from 
opposite  ends,  reducing  the  chances  of  conflict.  Shared  Normal  Segments  are  pageable  and  should 
be  associated  with  temporary  page  sets,  which  are  to  be  used  as  the  paging  areas.^Note  that  Shared 
Normal  pages,  like  Shared  Headers  pages,  are  shared  among  all  of  the  processes  of  a  single  arobject, 
which  are  executing  on  a  particular  node. 

Shared  Permanent  Segments: 

Shared  Permanent  Segments  are  almost  identical  to  Shared  Normal  Segments,  except  that  they 
contain  the  "permanent"  shared  variables  which  define  the  current  states  of  the  abstract  data  type 
instances.  Also,  eacn  Shared  Permanent  Segment  is  associated  with  a  permanent  page  set,  which  is 
used  as  both  its  paging  and  permanent  storage  area,^° 

Shared  Atomic  Segments: 

Shared  Atomic  Segments  are  identical  to  Shared  Normal  and  Shared  Permanent  Segments  in  most 
respects,  except  that  they  contain  the  "atomic”  shared  variables  which  define  the  current  states  of 
the  abstract  data  type  instances.  Because  of  this,  each  Shared  Atomic  Segment  is  associated  with  an 


g 

It  is  possible  to  use  a  single  temporary  page  set  as  the  paging  area  lor  all  Shared  Normal  Segments,  as  well  as  the  Shared 
Headers  Segment.  However,  the  use  of  separnic  page  sets  allows  the  individual  pnging  areas  to  be  located  on  the  same  nodes 
as  their  corresponding  abstract  data  type  instances,  increasing  efficiency.  In  addition,  it  makes  it  easier  to  "free”  Shared 
Normal  Seoments,  and  to  move  them  around  within  tlie  Shared  Region  (This  latter  operation  may  be  required  in  order  to  avoid 
conflicts  when  expanding  existing  segments.) 

^®Agaiii.  it  is  possible  to  use  a  single  permanent  page  set  to  hold  all  of  the  Shared  Permanent  .Segments,  but  the  same 
considerations  as  for  Shared  Normal  Segments  make  the  use  of  seiiarate  page  sets  a  bit  more  attractive. 


157 


atomic  page  set,  which  is  used  as  both  its  paging  and  atomic  (permanent)  storage  area.^^ln  order  to 
determine  which  portions  (variables)  within  Shared  Atomic  pages  have  been  modified  by  various 
transactions,  direct  writing  to  the  Shared  Atomic  Segments  is  not  permitted  while  in  user  mode  (these 
segments  are  read-only  in  user  mode).  Instead,  the  special  AtomicCopy  primitive  must  be  used  in 
order  to  modify  any  atomic  variables.  As  each  modification  is  made  to  a  Shared  Atomic  page,  the 
AtomicCopy  primitive  edso  records  the  modification  in  the  associated  atomic  page  set.  This  permits 
the  modification  to  later  be  committed  or  aborted  under  control  of  the  Transaction  Management 
Subsystem.  Note  that  since  each  modification  is  immediately  written  to  the  atomic  page  set,  there  will 
never  be  any  need  for  the  Time- Driven  Virtual  Memory  Subsystem  to  page-out  Shared  Atomic  pages. 
Also,  page-in  operations  only  require  the  Shared  Atomic  page  to  be  read  from  the  atomic  page  set, 
since  the  Atomic  Page  Set  Manager  will  ensure  that  the  page,  as  read,  will  reflect  all  outstanding 
(uncommitted)  modifications.  ^^For  more  details  on  the  handling  of  atomic  objects  and  transactions, 
see  Section  3.5. 

Address  Space  Management  Primitives 

The  base  kernel  provides  primitives  for  creating  and  destroying  address  spaces,  for  allocating, 
freeing,  growing,  and  moving  segments  within  an  address  space,  and  for  accessing  and  modifying  the 
information  associated  with  each  virtual  page  within  an  address  space  (mapping,  flags,  and  time  of 
last  access). ^^The  actual  address  space  management  primitives  provided  by  the  base  kernel  are  the 
following: 


this  case  too  it  is  possible,  though  slightly  less  attractive,  to  use  a  single  atomic  page  set  to  hold  all  of  the  Shared 
Atomic  Segments. 

12 

The  consistency  of  the  atomic  data  values  accessed  and  modified  by  a  transaction  is  not  guaranteed  by  these  facilities, 
unless  the  correct  locking  protocols  are  followed.  It  is  assumed  that  the  appropriate  read  and  wriie  locks  are  obtair>ed  for  each 
atomic  data  item  that  is  to  be  accessed  or  modified,  respectively.  Also  note  that  transaction  aborts  require  notifying  the  Atomic 
Page  Set  Manager  of  the  abort,  and  also  undoing  all  of  the  aborted  modifications  in  all  of  the  affected  Shared  Atomic  pages. 
This  latter  operation  is  best  accomplished  by  simply  flagging  .all  of  the  aftecteo  pages  as  "invalid",  so  that  the  next  attempt  to 
access  them  will  result  in  a  page-in  operation.  Since  Iho  modifications  associated  with  the  aborted  transaction  have  been 
removed  from  the  atomic  page  set.  the  page-m  operation  will  read  the  properly  modified  contents  of  the  page. 

13 

No  l.icilities  aie  provided  for  accessing  or  modifying  the  protection  codes  governing  access  to  the  various  parts  of  an 
address  space.  These  protection  ccoes  are  set  auiomalinaily  whenever  the  address  space  is  created,  and  v/hencver  new 
segments  are  allocated.  There  should  never  be  any  need  fo  modify  them  explicitly. 


158 


val  =  ASCreate(asid  [,  nvp-shared]) 
val  =  ASOestroy(asid) 
val  =  ASActivate(asid) 

vpa  =  ASAIIocate(asid,  seg-type,  nvp,  gpsid  [,  desired-vpa]) 

val  a  ASFree(asid,  vpa) 

val  s  ASExpand(asid,  vpa,  nvp) 

val  a  ASMove(asid,  vpa,  desired-vpa) 

ste  a  ASGetSTE(asid,  vpa) 
ste  a  ASNextSTE(asid,  vpa) 

val  a  ASSetMap(asid,  vpa,  ppa  [,  nvp]) 

ppa  a  ASGetMap(asid,  vpa) 

val  a  ASSetFlags(asid,  vpa,  flags  [,  nvp]) 

flags  a  ASGetFlags(asid,  vpa) 

val  a  ASSetTime(asid,  vpa,  time  [,  nvp]) 

time  =  ASGetTime(asid,  vpa) 

val  a  ASSetPTE(asid,  vpa,  pte  [,  nvp]) 

pte  a  ASGetPTE(asid,  vpa) 

(gpsid,  pnum)  a  ASGetGPSID(asid,  vpa) 

(asid.  vpa,  ppa)  *  ASFindShared(gpsid  [,  pnum]) 


BOOLEAN  val  TRUE  if  the  specified  operation  is  completed  successfully:  otherwise  FALSE.' 

VIRT-PAGE-ADDRESS  vpa,  desired-vpa 

A  virtual  page  address  within  an  address  space. 

SEG-TABLE- ENTRY  ste 

A  descriptor  for  a  segment  within  an  address  space.  It  includes  the  segment  type 
(seg-type),  location  (vpa),  size  (nvp),  and  associated  page  set  (gpsid). 

PHYS-PAGE-ADDRESS  ppa 

The  physical  page  address  in  primary  memory,  to  which  a  virtual  page  address  is 
mapped. 

PTE-FLAGS  flags  The  set  of  (BOOLEAN)  flags  associated  with  a  particular  page  in  an  address 
space:  USED,  MODIFIED,  VALID,  EXISTS,  and  COPIED. 

VIRT-TIME  time  The  (approximate)  virtual  (CPU)  time  at  which  a  particular  page  in  an  address 
space  was  last  accessed. 

PAGE-TABLE-ENTRY  pie 

A  descriptor  for  a  particular  page  in  an  address  space.  It  includes  the 
corresponding  physical  page  address  (ppa),  flags,  and  last  access  time  (time). 


159 


GPSID  gpsid  Global  Page  Set  ID,  which  identifies  the  page  set  associated  with  a  segment.  It 
includes  the  ID  of  the  logical  disk  containing  the  page  set,  the  page  set  type 
(TEMPORARY,  PERMANENT,  or  ATOMIC),  and  the  unique  ID  of  this  page  set 
within  the  logical  disk. 


INT  pnum  A  page  number  within  a  page  set,  corresponding  to  a  virtual  page  within  an 

address  space  segment. 

ASID  asid  Address  Space  ID,  which  identifies  the  address  space  to  be  operated  upon.  The 

corresponding  process  ID  can  be  easily  obtained  from  an  ASID,  and  vice  versa. 


INT  nvp-shared  The  number  of  virtual  pages  (size)  of  the  Shared  Region. 


SEGMENT-TYPE  seg-type 

The  type  of  the  address  space  segment  to  be  allocated:  USER-STACK,  USER- 
TEXT,  USER-DATA,  USER-HEAP,  SHARED-TEXT,  SHARED-HEADERS,  SHARED- 
NORMAL,  SHARED-PERMANENT,  or  SHARED-ATOMIC. 

INT  nvp  The  number  of  virtual  pages  involved  in  the  operation. 

On  Er'or.  Error  conditions  are  indicated  by  the  use  of  special  return  values.  The  details 

concerning  the  precise  nature  of  an  error  condition  are  provided  in  the  Kernel 
Error  Block. 


The  ASCreate  primitive  creates  a  new  address  space,  corresponding  to  a  newly  created  process. 
^'^The  address  space  ID  (asid),  which  is  closely  related  to  the  process  ID  (allowing  eeisy  translation 
back  and  forth),  must  be  specified  as  part  of  the  operation.  This  asid  will  be  used  to  identify  the 
address  space  in  all  subsequent  operations.  The  size  of  the  Shared  Region  (nvp-shared)  must  also 
be  specified  if  this  is  the  first  address  space  belonging  to  the  corresponding  arobject  to  be  created  on 
this  node.  Otherwise  it  is  optional. ’®Since  there  is  only  a  fixed,  maximum  amount  of  buffer  space 
available  for  storing  address  space  definitions,  the  ASCreate  primitive  can  fail  if  too  many  address 
spaces  have  already  been  defined. ^®The  ASDesiroy  primitive  can  be  used  to  destroy  a  previously 
defined  address  space. 


can  also  be  used  to  recreate  an  address  space  for  a  process  which  had  been  completely  "swapped  out",  but  will  soon 
be  required  to  run  again.  See  Section  3.10  for  more  details  about  process  swapping. 

^^Sincc  the  sizes  of  the  Kernel  and  Kernel  Interface  Regions  are  fixed,  specifying  the  size  of  the  Shared  Region  will  also 
determine  the  size  of  the  Private  Region  (there  are  no  other  regions  in  an  address  space).  Also,  if  another  address  space  from 
the  same  arobiect  already  exists  on  this  node,  the  size  of  the  Slwed  Region  is  alre.ady  known  (all  address  spaces  from  the 
same  arobject  have  identical  Shared  Regions).  The  arobject  ID  can  be  easily  determined  from  the  asid  (or  its  associated 
process  ID). 

One  remedy  for  this  is  to  swap  out  and  then  destroy  one  of  the  already  existing  address  spaces. . 


160 


ASActivate  is  used  when  switching  processes.  It  makes  the  specified  address  space  (as/d)  the 
currently  active  one,  i.e.  it  switches  the  processor  to  that  address  space.  Note  that  since  the  Kernel 
Rerion  is  identical  for  all  address  spaces,  only  the  Private  and  Shared  Regions  are  actually  affected 
by  this  switch.  The  processor  continues  to  execute  the  same  code  within  the  kernel  as  it  was  prior  to 
switching  address  spaces.  Depending  upon  the  architecture  of  the  system’s  memory  management 
hardware,  activating  an  address  space  may  require  the  explicit  loading  of  many  registers  within  the 
Memory  Management  Unit  (MMU).  The  management  of  these  MMU  registers  is  solely  the 
responsibility  of  the  address  space  management  routines,  especially  ASActivate.  Of  course,  MMU 
register  management  is  simplified  considerably  if  the  MMU  itself  handles  the  loading  of  its  mapping 
registers,  in  the  manner  of  a  cache. 

ASAllocate  allocates  a  new  segment  within  an  existing  address  space.  The  type  of  segment 
{seg-type)  determines  many  of  its  characteristics.  In  most  cases  only  one  segment  of  a  particular  type 
is  permitted  within  an  address  space.  However,  multiple  SHARED-NORMAL,  SHARED-PERMANENT, 
and  vSHARED- ATOMIC  segments  are  allowed.  At  the  time  a  segment  is  allocated,  its  initial  size  (nvp) 
and  associated  page  set  (gpsid)  must  be  specified. ’^In  most  cases  the  location  of  a  new  segment  is 
either  fixed  or  can  be  determined  automatically,  assuming  the  various  segment  types  are  allocated  in 
the  proper  order. ’^However,  the  exact  location  for  a  new  segment  can  be  specified  {desired-vpa), 
whenever  necesaar/.^^ASAilocate  returns  the  location  (vpa)  of  the  newly  allocated  segment.  For  all 
segments  except  USER-STACK  and  SHARED-HEADERS,  this  is  the  location  of  the  first  (lowest 
address)  page  in  the  segment.  For  USER-STACK  and  SHARED-HEADERS,  the  returned  location  is 
the  last  page  in  the  segment.  The  returned  vpa  will  be  used  to  identify  the  segment  in  subsequent 
operations.  ASAllocate  will  fail,  returning  BAD- VPA,  if  any  conflicts  are  detected,  such  as  attempting 
to  allocate  an  already  allocated  segment  type,  or  overlapping  an  existing  segment. 

ASFree  deletes  an  entire  segment,  indicated  by  vpa,  within  the  specified  address  space  (asid).  Only 
SHARED-NORMAL,  SHARED-PERMANENT,  and  SHARED-ATOMIC  segments  can  be  freed. 
^ASExpand  increases  the  size  of  an  existing  segment  by  the  specified  number  of  pages  (nvp).  It  will 


'^This  implies  that  the  associated  page  set  must  already  exist. 

18 

The  only  restrictions  on  the  ordering  ol  segment  allocations  are  that  USER-HEAP  must  follow  USER-DATA,  which  in  turn 
must  iollow  USER-TExr,  and  SHAHElhNORMAL.  SHAMED-PERMANEM  i ,  or  SHARED-ATOMIC  segments  mii.sl  be  allocated 
alter  the  SHARED  TEXT  segment. 

'^This  should  only  be  necessary  when  constructing  the  Shared  Region  on  a  new  node,  so  that  it  matches  the  Shared 
Region  for  an  arobiect  that  already  exists  on  other  nodes 

^'^This  happens  as  a  result  of  rfestroymg  abstract  data  type  instances. 


161 


fail  if  there  is  insufficient  space  for  the  segment  to  grow  by  the  amount  indicated.  ASMove  changes 
the  location  of  an  existing  segment  from  vpa  to  desired-vpa.  Only  SHARED-NORMAL,  SHARED- 
PERMANENT,  and  SHARED-ATOMIC  segments  can  be  moved.^^ ASMove  will  fail  if  the  new  location 
for  the  segment  would  overlap  another  existing  segment. 

ASGeiSTE  returns  a  descriptor  for  the  segment  which  contains  the  specified  virtual  page  (vpa). 
This  descriptor  indicates  the  segment’s  type  {seg-type),  location  (vpa),  size  (nvp),  and  associated 
page  set  {gpsid).  BAD-STE  is  returned  if  the  specified  page  is  not  within  one  of  the  existing  Private 
Region  or  Shared  Region  segments.^^ASNexfSTE  is  similar  to  ASGetSTE,  except  that  it  returns  a 
descriptor  for  the  next  (higher  address)  segment  which  follows,  but  does  not  contain,  the  specified 
virtual  page.  Specifying  vpa  =  0  will  cause  the  descriptor  for  the  first  (lowest  address)  segment  in  the 
Shared  Region  (SHARED-TEXT)  to  be  returned,  or  the  descriptor  for  USER-TEXT  to  be  returned  if 
there  is  no  Shared  Region.  BAD-STE  will  be  returned  if  the  specified  page  is  within  or  beyond  the  last 
segment  of  the  Private  Region  (USER-STACK).  ASNextSTE  is  useful  for  scanning  through  all  of  the 
segments  (and  pages)  which  constitute  an  address  space. 

ASSetMap  sets  the  virtual  to  physical  mapping  for  virtual  page  vpa,  to  physical  page  ppa. 
Optionally,  a  range  of  nvp  virtual  pages,  beginning  with  vpa,  can  be  mapped  to  contiguous  physical 
pages,  beginning  with  ppa.  ASGetMap  returns  the  physical  page  address  (ppa)  to  v/hich  the  specified 
virtual  page  (vpa)  is  mapped.  BAD-PPA  is  returned  if  the  mapping  has  not  been  previously  defined 
using  ASSetMap  (or  ASSetPTE).  ASSetriags  sets  all  of  the  (BOOLEAN)  flags  associated  with  the 
specified  virtual  page  (vpa).  Optionally,  the  flags  for  a  range  of  nvp  virtual  pages,  beginning  with  vpa, 
can  all  be  set  to  the  same  values  {flags).  The  available  flags  are: 

•  USED:  The  virtual  page  has  been  accessed.  This  flag  helps  determine  which  pages 
belong  to  the  working  set  of  a  process  (See  Section  3.10). 

•  MODIFIED;  The  virtual  page  has  been  modified.  This  flag  indicates  that  the  page  must  be 
written  to  the  associated  page  set  before  the  physical  page  frame  can  be  reused. 

•  VALID:  The  physical  page  address  to  which  the  virtual  page  is  mapped  is  valid,  i.e.  the 
page  is  in  primary  memory. 

•  EXISTS:  The  virtual  page  exists  in  the  associated  page  set,  i.e.  the  page  has  been  written 
some  time  in  the  past.  This  flag  helps  avoid  page-in  operations  when  a  newly  allocated 


21 

Moving  of  segments  can  be  a  useful  way  to  recover  from  ASExpand  failures. 

22 

For  our  purposes  here,  the  Kernel  Stack  .Segment  is  not  concidored  a  part  of  the  Private  Region.  Only  the  "USER" 
segments  are. 


162 


virtual  page  is  first  accessed.^ 

•  COPIED;  The  virtual  page  (within  the  User  Data  Segment)  has  been  "copied”  to  the 
temporary  page  set,  i.e.  the  temporary  page  set  contains  a  newer  version  of  the  page  than 
the  permanent  (initial  data)  page  set.  This  flag  only  applies  to  the  User  Data  Segment, 
and  it  is  used  to  indicate  which  page  set  is  to  be  used  when  paging-in  the  virtual  page.^^ 

ASGetFlags  returns  the  set  of  flags  associated  with  the  specified  virtual  page  (vpa). 

ASSetTime  sets  the  time  of  last  access  for  the  specified  virtual  page  (vpa)  to  time.  The  time  value  is 
in  virtual  (CPU)  time  units.  Optionally,  the  last  access  time  for  a  range  of  nvp  virtual  pages,  beginning 
with  vpa,  can  all  be  set  to  the  same  value  of  time.  ASGetTime  returns  the  last  access  time  for  the 
specified  virtual  page.  NEVER  is  returned  if  the  page  has  never  been  accessed.  ASSetPTE  is 
equivalent  to  ASSetMap,  ASSetFlags,  and  ASSetTime  combined.  It  sets  all  three  items  of  the  virtual 
page  descriptor(s)  (ppa,  flags,  and  time)  to  the  specified  values  (pfe).  Similarly,  ASGetPTE  is 
equivalent  to  ASGetMap,  ASGetFlags,  and  ASGetTime  combined.  ASSetPTE  and  ASGetPTE  are 
provided  as  a  convenience,  for  use  when  entire  page  descriptors  must  be  modified  or  retrieved. 

ASGetGPSID  returns  the  global  page  set  ID  (gpsid)  and  page  number  (pnum)  associated  with,  the 
specified  virtual  page.  This  is  the  page  set  ID  and  page  number  to  be  used  when  paging-in  or 
paging-out  this  virtual  page.^^BAD-GPSID  and  BAD-PNUM  are  returned  if  the  specified  page  is  not 
contained  within  any  of  the  existing  segments  in  the  Private  Region  or  Shared  Region.  ASFindShared 
searches  for  the  specified  page  number  (pnum)  from  the  given  page  set  (gpsid),  to  see  if  it  is  already 
resident  in  primary  memory,  and  in  active  use  within  one  of  the  existing  address  spaces.^lf  pnum  is 
not  specified,  the  search  is  for  any  active  page  from  the  given  page  set.  If  the  search  is  successful, 
the  address  space  ID  (asid)  and  virtual  page  address  (vpa)  of  the  first  encountered  matching  entry  is 
returned,  along  with  the  corresponding  physical  page  address  (ppa).  Otherwise  BAD-ASID,  BAD- 
VPA,  and  BAD-PPA  are  returned.  ASFindShared  aids  in  the  handling  of  "shareable"  segments,  such 


23 

I?  isn't  incorrect  to  page  in  a  nonexistent  page.  It  would  simply  be  read  as  all  zeros  (see  Section  3.7).  However,  the 
EXISTS  flag  helps  avoid  the  overhead  of  the  (unnecessary)  page  in  operation.  Whether  a  nonexistent  page  is  initialized  to  zero 
on  first  access  or  simpfy  left  undefined  depends  upon  the  type  of  segment  it  belongs  to.  User  StacK  Segment  pages  can  be  left 
unaefined.  but  pages  m  most  other  .segments  should  be  initialized  to  zero. 

The  COPIED  flag  is  essentially  another  name  for  the  EXISTS  flag,  as  it  applies  to  the  User  Data  Segment.  Each  User  Data 
Segment  page  is  known  to  exist  in  the  permanent  (initial  data)  page  set.  The  only  question  is  whether  a  newer  version  also 
exists  in  the  temporary  page  set. 

25 

ASCeiOPSiD  takes  the  COPIED  flag  into  account  when  determining  the  page  set  to  be  used  for  pages  in  the  User  Data 
Segment. 

This  involves  searctiing  tlirough  the  existing  address  sp.aces  for  any  segments  having  gpsid  as  the  corresponding  page 
set.  The  viitual  page  descriptor  for  the  page  corresponding  to  pnum  is  then  checkeil  to  see  if  it  is  VALID. 


163 


as  the  User  Text  Segments  and  all  of  the  Shared  Region  segments.  In  particular,  it  can  help 
determine  if  page-in  or  page-out  operations  are  actually  required. 

Address  Space  Management  Data  Structures 

The  information  describing  the  existing  address  spaces  is  contained  within  kernel  data  structures 
called  address  space  descriptors.  These  descriptors  are  linked  together  in  groups  according  to  the 
arobjects  to  which  the  address  spaces  belong,  as  shown  in  Figure  3-4.  This  aids  in  the  handling  of 
the  Shared  Regions  of  address  spaces,  especially  their  construction  and  modification,  since  it  allows 
all  of  the  related  address  spaces  to  be  updated  “simultaneously”. 


Arobject  List 


Figu re  3*4:  Arobject  Address  Space  Lists 


The  structure  of  each  address  space  descriptor  is  illustrated  in  Figure  3-5.  It  basically  consists  of  a 
list  of  Segment  Descriptors,  which  describe  each  of  the  segments  contained  within  the  address 
space.  Note  that  since  there  can  be  varying  numbers  of  Shared  Normal,  Shared  Permanent,  and 
Shared  Atomic  Segments,  each  of  these  types  has  its  own  (sub)list  of  segment  descriptors.  Any 
segment  type  which  is  not  present  in  an  address  space  would  be  indicated  by  a  page  count  of  zero 
(nvp  =  0).  Each  segment  descriptor  includes  a  pointer  (pta)  to  the  page  table,  which  describes  the 
state  of  the  individual  pages  of  the  segment.  Note  that  each  address  space  within  an  arobject  will 
actually  have  its  own  set  of  page  tables,  even  for  the  Shared  Region.  This  is  because  the  Shared 
Region  can  be  accessed  and  used  in  very  different  ways  by  the  different  processes  of  an  arobject, 
and  it  is  important  to  determine  the  virtual  memory  “working  sets"  on  a  per  process  (per  address 
space;  basis  (see  Section  3.10). 

The  other  main  data  structure  used  in  the  management  of  address  spaces  is  the  Shared  Page  Set 
List,  illustrated  in  Figure  3-6.  The  sole  purpose  of  this  structure  is  to  improve  the  efficiency  of  the 


164 


Address  Spac 

e  Descriptor 

prev  next 

asid 

Segment  Page  Table 

User  Stack  Segment  Descriptor 


User 

Heap 

Segment 

Descriptor 

User 

Data 

Segment 

Descriptor 

User 

Text 

Segment 

Descriptor 

Shared 

Headers  Segment  Descriptor 

ShareU  Text  Segment  Descriptor 


Shared  Normal  Segment  Descriptor  next 


Shared  Permanent  Segment  Descriptor  next 


Shared  Atomic  Segment  Descriptor  next 


Figure  3-5:  Address  Space  Descriptor 


ASFindShared  primitive.^Before  paging-in  or  paging-out  a  potentially  shared  page,  i.e.  one  from  the 
User  Text  Segment  or  any  segment  within  ttte  Shared  Region,  the  Time-Driven  Virtual  Memory 
Subsystem  must  first  check  (using  ASFindShared)  to  see  if  the  required  page  is  already  in  main 
memory  and  in  use  by  some  other  process.  If  so.  the  paging  operation  can  be  avoided.  Given  the 
global  page  set  ID  (gpsid)  for  the  potential  paging  operation,  ASFiridShared  will  look  for  that  gpsid  in 
the  Shared  Page  Set  List,  and  then  check  each  associated  Segment  Page  Table  to  see  if  the  page  in 
question  is  ever  listed  as  “VALID”.  Thus,  the  Shared  Page  Set  List  contains  an  entry  for  every  page 
set  corresponding  to  a  User  Text  Segment  or  Shared  Region  Segment,  in  any  existing  address  space. 
Associated  with  each  entry  is  a  list  of  all  the  segments  (indicated  by  their  address  space  IDs,  virtual 
page  addresses,  and  page  table  pointers),  which  share  the  use  of  that  page  set. 


^Since  the  Shared  Page  Sot  List  could  be  constnicted  solely  Ironi  the  contents  of  the  Address  Space  Descriptors,  its  use  is 
not  strictly  required.  However,  it  greatly  increases  the  efficiency  of  searching  for  shared  pages. 


165 


Shared  Page  Sat  List 


gpsid 

head 

gpsid 

head 

Figure  3-6:  Shared  Page  Set  List 


3. 2. 2. 6  Synchronization  Management 

The  low-level  synchronization  mechanisms  for  arobjects  and  for  basic  I/O  handling  functions  are 
provided  at  the  base  kernel.  These  primitives  are  designed  as  a  local  synchronization  mechanism,  so 
remote  invocation  is  not  supported  at  this  level.  All  higher-level  synchronization  must  be  performed 
by  using  the  communication  primitives. 


The  following  primitives  are  provided  at  the  base  kernel: 


evtcnt  =  Sigsend(event-var) 
evtcnt  X  Sigrec(event-var,  timeout) 
evtcnt  =  Sigrecall(event-var,  timeout) 
evtcnt  X  Sigabort(event-var,  abort-code) 

EVTCNT  evtcnt  The  counter  value  of  the  specified  event  variable. 

EVENT- V.AR  event- var 

The  event  variable  consists  of  a  waiting  queue  of  client  processes  and  an  event 
counter. 

TIMEOUT  timeout  The  timeout  value  should  indicate  the  maximum  execution  time  for  this  signal 
primitive  including  the  waiting  time. 

ABORTCODE  abort-code 

An  integer  value  which  indicates  an  abort  code. 


The  Sigsend  and  Sigrec  primitives  are  basically  similar  to  the  V-  and  P-operations  of  an  integer 
semaphore.  However,  the  Sigrec  primitive  wilt  bo  timed  out  if  the  corresponding  signal  (e.g.,  a 
hardware  interrupt)  is  not  generated.  A  Sigrecull  primitive  is  similar  to  Sigrec  and  used  for  receiving 


166 


all  stored  event  signals.  A  Sigabort  primitive  can  be  used  to  unblock  the  waiting  process  with  an  error 
condition. 

3.3  Arobject/Process  Management  Subsystem 

The  Arobject/Process  Management  Subsystem  is  responsible  for  providing  arobject/process 
management  facilities  in  ArchOS.  The  subsystem  consists  of  a  Arobject/Process  Manager  and  its 
worker  processes  on  each  node.  The  Arobject/Process  Manager  provides  a  system-wide  facility  in 
corporation  with  the  base  kernel.  The  workers  are  provided  to  perform  actual  work  or  decision 
making  among  cooperating  Arobject/Process  Managers  in  the  system. 

The  Arobject/Process  Manager  primary  responsible  the  following  operations: 

•  Creation  and  destruction  of  an  arobject  and  process 

•  Freezing  and  Unfreezing  of  an  arobject’s  or  process's  activities 

•  Binding  and  unbinding  of  reference  names  for  arobjects  and  processes 

•  Allocation  and  deallocation  of  private  data  objects 

•  Internal  access  mechanisms  to  fetch  and  store  any  data  objects  for  an  arobject/process. 

•  Recovery  management  for  atomic  arobjects 

It  should  be  noted  that  many  functions  can  be  invoked  locally  by  the  base  kernel.  This  subsystem  is 
necessary  to  provide  the  service  for  remote  invocations. 

3.3.1  Arobject/Process  Management 

An  Arobject/Process  Manager  exists  on  each  node  and  manages  a  fixed  number  of  workers.  Every 
worker  can  perform  the  following  service  functions: 

•  Coordinate  with  the  other  Arobject/Process  Manager  to  perform  the  best  assignment 
decision  for  creation  of  a  new  arobject/process  instance. 

•  Propergate  a  new  reference  name  to  the  other  Arobject/Process  Manager. 

The  basic  components  of  the  arobject/process  subsystem  is  shown  in  Figure  3-7. 

When  a  new  instance  of  an  arobject  or  process  is  created,  a  system-wido  unique  identifier  called  an 
arobject  id  (aid)  or  process  id  (pid)  is  created  and  is  also  guaranteed  to  be  unique  over  the  lifetime  of 
the  arobject  or  process.  For  a  new  instance  of  an  arobject,  an  arobject  descriptor  is  created  and 


167 


Raference  Nana  Hashtabla  (RNH) 


Figu  re  3-7:  Components  of  the  Arobject/Process  Subsystem 


registered  in  the  known  arobject  table  in  the  kernel.  For  a  new  process  instance,  a  process  descriptor 
is  also  allocated  and  linked  to  its  arobject  descriptor. 


The  arobject  descriptor  contains  the  following  information  to  control  its  arobject  components: 

•  Arobject  id  (aid); 

An  aid  is  a  fixed- length  descriptor  consisting  of  current  node  id,  birth  node  id,  and  local' 
unique  id.  Aid  is  used  to  identify  a  destination  arobject  where  a  request  operation  will  be 
invoked,  so  the  current  node  id  and  birth  node  id  are  included  for  reducing  the  searching 
process  and  migration  process. 

•  Parent  aid: 

Its  parent's  aid. 

•  Arobject  status: 

Status  of  the  arobject. 

•  Freeze/Unfreeze  event  variable: 

An  event  variable  to  control . 

•  A  set  of  reference  pointers  to  private  object  descriptors: 

A  private  object  descriptor  contains  the  data  associated  with  the  arobject's  private  object, 
namely  "processes",  “shared  abstract  data  types",  "statistics  data  object",  "message 
queue",  etc. 

For  instance,  suppose  that  arobject  A  has  two  processes  (INITIAL  and  P,)  and  one  shared  private 
object  (SO^)  and  a  new  instance  of  A  is  created  at  node  X.  After  the  INITIAL  process  is  started,  it 
creates  process  P,  at  node  Y.  Then,  the  relationship  among  the  major  kernel  objects  are  shown  in 
Fioure  3-8. 


At  Node  X 


At  Node  Y 


Figu  re  3-8:  An  examole  of  Arobject  Descriptors 


At  node  X,  an  arobject  descriptor  is  allocated  for  an  instance  of  arobject  A.  A  process  descriptor  for 
INITIAL  and  a  shared  object  descriptor  for  the  header  segment  and  for  SO,  are  linked  to  the  arobject 
descriptor.  At  node  Y,  there  are  also  an  arobject  descriptor  for  A  and  a  process  descriptor  for  P, 
which  is  linked  to  the  arobject  descriptor.  Although  there  is  no  shared  private  object  created  at  node 
Y,  a  shared  object  descriptor  is  also  allocated  for  the  header  segment  and  linked  to  the  arobject 
descriptor. 

It  should  be  noted  that  the  message  queue  of  arobject  A  is  only  allocated  on  node  X.  Thus,  when  P, 
attempts  to  accept  a  request  message,  a  remote  operation  is  invoked  to  fetch  a  message  from  the 
message  queue  at  node  X.  If  there  is  nc  acceptable  request  message  in  the  queue,  P,  will  be  blocked 
and  placed  in  the  waiting  processes’  queue. 

3. 3. 1,1  Arobject/Process  Assignment  Policy 
When  creation  of  a  new  instance  of  an  arobject  or  process  is  requested  at  a  non-specific  node,  the 
arobject/process  manager  selects  the  best  node  according  to  the  current  "arobject/process 
assignment  policy". 


The  assignment  policy,  like  other  user  definable  policies,  can  be  set  by  a  system  designer.  In  order 
to  reduce  system  overhead,  the  policy  definition  module  for  the  assignment  policy  is  placed  in 
Arobject/Process  Manager. 


169 


Without  creating  a  new  policy  definition  module,  the  Arobject/Process  Manager  can  provide  the 
following  assignment  policies. 

•  First  Fit  (FF):  The  first  A/P  manager  which  replies  the  creation  request  will  be  selected. 

•  Random  Fit  (RF):  A  A/P  manager  from  the  random  selection  will  be  used. 

•  Best  Fit  (BF):  One  of  the  best  matched  A/P  manager  will  be  selected. 

•  Best  Effort  Fit(BEF); 

3.3.1 .2  Life  cycle  of  Arobject/Process 

The  life  cycle  of  an  arobject  depends  upon-whether  the  arobject  is  an  atomic  or  nonatomic.  When 
an  instance  of  an  arobject  is  instantiated,  the  arobject  instance  becomes  active.  If  an  arobject  is 
atomic,  it  can  be  inactivated  and  remained  on  stable  storage.  On  the  other  hand,  if  an  arobject  is 
nonatomic,  it  cannot  be  inactivated  and  it  must  be  dead  when  it  is  killed.  Atomic  and  nonatomic 
arobject  can  be  frozen  for  monitoring  or  debugging  purpose. 

The  life  cycle  of  process  is  similar  to  the  arobject’s  one.  When  an  instance  of  a  process  is  created,  it 
becomes  ready  and  runnable.  Once  the  time-driven  scheduler  deciedes  to  run  a  process,  it  becomes 
running  and  the  process  may  block  due  to  I/O  waiting  or  scheduler’s  preemption.  Like  arobject,  a 
process  will  be  dead  when  it  is  killed  and  it  can  be  also  frozen.  The  life  time  of  a  process  instance 
depends  on  the  life  time  of  its  arobject.  When  the  arobject  is  killed,  all  of  its  processes  will  be  also 
killed  by  the  system.  Thus,  there  will  be  no  frozen  processes  left  in  its  arobject. 

The  life  cycle  of  an  arobject  and  process  is  depicted  in  Figure  3-9. 

3.3.1 .3  ArchOS  primitives 

The  following  ArchOS  primitives  are  supported  for  arobject/process  management  for  a  client. 


Arobject  Status: 


active 


f  rozen 


unfraaz* 


activate  deactivate 

^  inactive  y 


Process  Status: 


waiting 


wakeup  /  block 


ready 


frozen 


freeze  /unfreeze 


running 


schedule 


:x 


Pigu  re  3*9:  Life  Cycle  of  an  Arobject  and  Process 


arobject-id  =  CrealeArojbect(arobi-name[,  init-msg]  [,  node*id]) 
process-id  =  CreateProcess(process-name[,  init-msg]  [,  node-id]) 
val  3  KillArobject(aid) 
val  =  KillProcess(pid) 

val  3  GlobalKiliArobject(arobj-id,  kill-options) 
val  3  GlobalKillProcess(pid,  options) 

aid  3  SelfAidf) 
pid  3  SelfPidO 
paid  3  ParentAid(aid) 
ppid  3  ParentPid(pid) 

val  3  SetErrorStackferrblock,  blocksize) 
val  3  FreezeArobject(arobj-id,  options) 
val  3  UnfreezeArobject(arobj-id,  options) 
val  3  FreezeProcess(pid,  options) 
val  3  UnfreezeProcess(pid,  options) 

fval  3  FetchA  robjectStatus(arobj-id,  dataobj-id,  buffer,  size) 
sval  =  StoreArobjectfarobj-id,  dataobj-id,  buffer,  size) 
fval  =  FetchProcessStatus(pid,  dataobj-id,  buffer,  size) 
sval  3  StoreProcessfpid,  dataobj-id,  buffer,  size) 


171 


BOOLEAN  val  TRUE  if  the  primitive  was  executed  properly:  otherwise  FALSE. 

NODE-ID  node-id  The  node  id  indicates  the  actual  node  which  will  be  stopped. 

AID  arobj-id  The  unique  arobject  id  of  an  arobject  instance. 

PID  pid  The  process  id  of  the  target  process. 

GKILL-OP  kill-options 

The  options  indicate  various  control  options.  For  example,  it  can  indicate  whether 
the  caller  stops  every  time  after  killing  a  single  process  or  not. 

FREEZE-OPT  options 

The  options  indicate  various  selectable  flags  such  as  a  timeout  freeze./unfreeze 
flag. 

INT  fval  The  actual  number  of  bytes  which  were  fetched. 

INT  sval  The  actual  number  of  bytes  which  were  stored. 

DATAOBJ-ID  dataobj-id  The  dataobj-id  indicates  the  private  object  or  system  control  status  of  the 
target  arobject/process. 

BUFFER  'buffer  A  pointer  to  the  buffer  area  for  storing  the  returned  data  object  value. 

INT  size  The  size  indicates  the  buffer  size  In  bytes. 


A  CreateArobiect  primitive  creates  a  new  instance  of  an  arobject  at  an  arbitrary  node  or  a  specified 
node.  Similarly,  a  CreateProcess  primitive  creates  a  new  instance  of  a  process  in  the  arobject.  The 
selection  of  a  node  is  made  automatically  by  ArchOS  unless  overridden  by  the  Create  operation. 
Upon  arobject  creation,  the  arobject's  INITIAL  process  is  automatically  dispatched.  An  optional  .set  of 
parameters  can  be  passed  to  the  INITIAL  process  when  the  arobject  is  instantiated  by  using  an  initial 
message  (i.e.,  "init-msg"). 

The  KiliArobject  and  KillProcess  primitives  remove  a  process  and  arobject  instance  respectively.  An 
arobject  may  be  killed  only  by  one  of  its  own  processes  (suicide  allowed,  no  murder).  In  order  to  kill 
another  arobject,  the  target  arobject  must  have  an  appropriate  operation  defined  within  its 
specification  so  it  can  kill  itself.  A  process  can  be  killed  only  by  a  process  which  exists  in  the  same 
arobject  instance. 

The  SelfAid  primitive  returns  the  requestor’s  arobject  id  and  the  ParentAid  primitive  returns  the 


172 


parent’s  arobject  id  of  the  specified  arobject.  The  SelfPid  primitive  returns  the  process  id  (pid)  of  the 
requestor  and  the  ParentPid  primitive  returns  the  parent's  pid  of  the  the  specified  process. 

A  SetErrorBlock  primitive  sets  an  error  block  in  a  process's  address  space.  A  user  error  block 
consists  of  a  head  pointer  and  a  circular  queue.  The  head  pointer  contains  a  pointer  to  an  entry 
which  contains  the  latest  error  information  in  the  circular  queue.  After  the  execution  of  this  primitive, 
a  client  can  access  the  detailed  error  information  from  the  specified  error  block.  A  PreezeArobject 
primitive  stops  the  execution  of  an  arobject  (i.e.,  all  of  its  processes),  and  a  FreezeProcess  primitives 
halts  a  specific  process  for  inspection.  An  UnfreezeArobject  and  UnfreezeProcess  primitive  resumes 
a  suspended  arobject  and  process  respectively.  While  a  process  is  in  a  frozen  state,  many  of  the 
fa  ors  used  for  making  scheduling  decisions  can  be  selectively  ignored.  For  instance,  a  timeout 
value  will  be  ignored  by  specifying  a  proper  flag  in  the  Freeze  primitive. 

A  Fetch  primitive  inspects  the  status  of  a  running  or  frozen  arobject  or  process  in  terms  of  a  set  of 
frozen  values  of  private  data  objects.  The  specific  state  of  the  arobject  or  process  will  be  selected  by 
a  data  object  id.  The  state  includes  not  only  the  status  of  private  variables,  but  also  includes  process 
control  information. 

A  GlobalKill  primitive  can  destroy  an  arbitrary  arohject  or  process  in  the  system. 

3.3.1 .4  Creation  and  Destruction  of  Arobject/Process 

A  client  can  create  a  new  instance  of  arobject  at  any  node  in  the  system  by  issuing  the 
CreateArobject  primitive.  The  first  argument,  arobj-name,  contains  three  values;  The  first  value 
contains  a  file  name  for  the  image  of  the  shared  region  and  the  second  value  points  to  a  file  for  the 
image  of  the  private  region  of  its  INITIAL  process.  The  last  section  contains  a  table  of  entry  points  for 
light-weight  processes.  Note  that  the  light-weight  process  is  available  for  kernel  arobjects. 

For  instance,  arobject  name  A  may  contains  {"/usr/test/a.arobj",  "/usr/test  /pO.proc",  nill}. 
When  a  client  executes  'CreateArobjectfA,  msg,  Y)",  at  first  the  base  kernel  determines  whether  the 
target  node  is  local  or  remote.  Since  this  is  a  remote  invocation  request,  it  looks  up  the  A/P 
manager’s  reference  name  hash-table  and  determines  the  destination  A/P  manager’s  .AID.  Then,  the 
client  sets  up  a  request  message  for  the  remote  A/P  manager  and  issues  a  proper  invocation  request. 

When  the  remote  Communication  manager’s  Netin  worker  receives  the  request  packet,  it  placed  in 
the  target  A/P  manager's  request  queue.  If  one  of  its  workers  is  already  waiting  on  the  incoming 
invocation  request,  the  request  message  will  be  placed  directly  into  the  worker’s  message  buffer. 


173 


Once  the  worker  executes  the  CreateArobject  primitive,  the  virtual  address  space  for  arobject  A  is 
created  by  reading  "/usr/test/a.arobj"  and  "/usr/test  pO.proc"  files.  A  new  arobject  descriptor  is 
also  allocated  and  its  AID  is  returned  to  the  worker.  Then,  the  worker  returns  the  result  message  to  its 
A/P  manager  and  the  A/P  manager  forward  the  result  to  the  caller's  A/P  manager.  Then,  the  original 
caller  receives  the  result  from  the  local  A/P  manager. 

The  sequence  of  interaction  between  two  A/P  manager  is  depicted  in  Figure  3-10. 


Figu  re  3- 1 0:  Creation  and  Destruction  of  an  Arobject  and  Process 


174 


3.3.2  Name  Management 

An  arobject/process  manager  maintains  binding  information  in  a  hash  table.  When  a  reference 
name  is  bound  to  a  caller,  the  caller’s  arobject/process  manager  registers  its  name  first.  Then,  the 
manager’s  worker  must  propagate  its  entry  across  the  system  using  one-to-many  communication  (i.e., 
using  a  RequestAll  primitive).  The  lifetime  of  a  binding  is  the  same  as  the  lifetime  of  an  arobject  or 
process  instance. 


To  find  one  or  all  arobject/process  instances  from  its  reference  name,  the  arobject/process 
manager  searches  its  own  hash  table  and  if  there  is  no  entry  there,  then  it  inquires  from  the  other 
managers. 


The  following  ArchOS  primitives  are  supported  for  name  management  for  a  client: 


val  =  BindArobjectName(aid,  arobj-refname) 
val  =  BindProcessNamefpid,  process-refname) 
val  =  UnbindProcessNamefpid,  process-refname) 
val  =  UnbindArobjectName(aid,  arobj-refname) 

aid  =  FindAid(arobj-refname  [,  preference]) 
pid  =  FindPid(process-refname  [,  preference]) 
aid-list  =  FindAIIAidfarcbj  rcfname  [,  preference]) 
pid-list  =  FindAIIPid(process-refname  [,  preference]) 


AID  aid 

The  arobject  id. 

PID  pid 

The  process  id. 

AID-LIST  aid-list 

The  list  of  corresponding  aid’s. 

PID-LIST  pid-list 

The  list  of  corresponding  pid’s. 

AR08J-REFNAMF  arobj-refname 

The  reference  name  of  related  arobjectfs). 

PROCESS-REFNAME  process- refname 

The  reference  name  of  related  process(es). 

PREFERENCE  preference 

The  preference  can  specify  a  search  domain  such  as  "INTERNAL", 
"EXTERNAL",  "LOCAL",  "REMOTE",  "INTERNAL-LOCAL".  "INTERNAL- 
REMOTE",  "EXTERNAL-LOCAL",  "EXTERNAL-REMOTE". 


The  BindArobiectName  and  BindProcessName  primitives  bind  the  requested  instance  of  an  arobject 


175 


or  process  to  a  reference  name.  This  binding  allows  an  arobject  or  a  process  to  have  more  than  one 
reference  name,  or  a  single  reference  name  can  be  bound  multiple  arobject  or  process  instances.  To 
cancel  the  current  binding,  a  process  must  use  the  appropriate  Unbind  primitive. 

A  FindiD  primitive  returns  the  unique  id  (i.e.,  aid  or  pid)  of  the  given  arobject  or  process  in  a  specific 
search  domain  A  search  domain  can  be  specified  with  respect  to  all  of  the  internal  arobjects, 
external  arobjects,  a  local  node,  a  remote  node,  or  a  reasonable  combination  of  among  four.  If  more 
than  one  instance  uses  the  same  reference  name,  the  unique  id  of  any  one  of  them  will  be  returned.  A 
FindAiilD  primitive,  on  the  other  iiand,  returns  all  of  the  aid’s  and  pid’s  which  correspond  to  the  given 
reference  name. 

3.3.3  Private  Object  Management 

Private  object  management  allows  a  client  process  to  allocate  and  deallocate  an  instance  of  a 
private  abstract  data  type  at  any  node.  An  Arobject/Process  manager  maintains  a  list  of  private 
object  descriptors  under  its  arobject  descriptor  to  keep  the  data  associated  with  it. 

Since  an  object  type  can  be  one  of  Normal,  Permanent,  and  Atomic,  the  private  object  manager 
must  coordinate  with  a  page  set  manager  (see  Section  3.7)  to  allocate  a  proper  type  of  page  set  to 
create  a  new  instantiation  of  the  abstract  data  type. 

If  a  remote  allocation  is  requested,  the  creation  of  a  new  private  abstract  data  type  must  be 
coordinated  between  the  destination’s  Arobject/Process  manager  and  the  one  with  the  INITIAL 
process.  Since  every  process  share  all  private  segments  and  should  have  a  uniform  view  of  local  and 
remote  instances,  the  shared  header  segment  must  be  updated  by  the  Arobject/Process  manager  at 
the  INITIAL  process’s  node.  When  a  proper  update  is  done,  updated  part  of  the  shared  header 
segment  is  propagated  to  the  other  Arobject/Process  manager. 

object-ptr  =  AllocateObjectftype-name,  object-type,  parameters,  [,  node-id]) 

val  =  FreeObject(object-ptr) 

val  =  FlushPermanent(object-ptr,  size) 

OBJECT-PTR  object-ptr 

A  pointer  to  the  allocated  private  data  object. 

OBJECT-TYPE  object-type 

The  object-type  indicates  the  name  of  a  private  abstract  data  type. 

BOOLEAN  val  TRUE  if  the  object  was  released  successful;  otherwise  FALSE. 

NODE  ID  node-id  Node  identification.  An  actual  node  may  be  designated,  or  a  node  selection 


176 


criterion  may  be  designated  (e.g..  the  current  node,  any  node  except  the  current 
node,  any  node,  or  a  specific  node). 

INT  size  The  number  of  bytes  which  must  be  flushed  into  permanent  storage. 


An  AllocateObject  primitive  allocates  an  instance  of  a  private  abstract  data  type  at  any  node  and  a 
FreeObject  primitive  deallocates  the  specified  instance.  A  FlushPermanent  primitive  blocks  the  caller 
until  the  specified  data  object  is  saved  in  non-volatile  storage. 

3.3.4  Recovery  Management 

The  arobject/process  manager  is  responsible  for  restarting'  atomic  arobjects  which  have  at  least 
one  private  atomic  data  object  in  the  event  of  node  failures.  Since  all  private  atomic  objects  are  kept 
on  a  corresponding  atomic  page  set,  the  arobject/process  manager  will  coordinate  with  the  page  set 
subsystem  to  resume  the  crashed  atomic  arobject  by  recreating  its  initial  process  with  the  pre-crash 
image  of  the  private  atomic  data  objects  and  permanent  data  objects,  if  any. 

It  should  be  noted  that  it  is  still  an  application  designer’s  responsibility  to  determine  what  recovery 
action  must  take  place  based  on  its  atomic  and  permanent  data  objects. 

3.4  Communication  Subsystem 

The  communication  subsystem  provides  intra-  and  inter- node  message  communication 
mechanisms  to  support  a  system-wide,  location-  independent  operation  invocation  for  cooperating 
arobjects.  An  invocation  request  can  be  initiated  by  referring  to  the  destination  arobject's  id  and  the 
operation  name  from  anywhere  in  the  system.  Although  a  single  invocation  is  initiated  through  the 
arobject  id,  multiple  invocation  can  be  initiated  by  referring  to  a  reference  name  of  arobjects  which 
offer  the  same  service.  In  this  case,  a  calling  arobject  may  continue  to  perform  its  activity  and  receive 
one  or  more  results  by  using  a  GetReply  primitive. 

The  communication  subsystem  also  interacts  with  the  transaction  subsystem  to  coordinate  the 
necessary  transaction  management.  For  instance,  a  Request  primitive  is  treated  as  an  elementary 
transaction  consisting  of  three  steps;  a  sending  part  of  the  request,  an  invocation  linking  part,  and  a 
receiving  part  of  the  request.  Then,  the  linking  part  links  the  requestee’s  Accept,  computation,  and 
Reply.  Since  all  message  activities  in  this  transaciion  are  defined  as  compound  transactions  and  the 
invocation  linking  pan  is  defined  as  an  elementary  transaction,  it  is  possible  lor  the  requestee's 
com()iitation  to  be  a  nested  elementary  or  compound  transaction  of  the  top-level  transaction. 


177 


3.4.1  Message  Header  and  Body 

A  message  consists  of  header  and  body  parts.  The  header  part  is  not  writable  from  a  client  and  only 
the  body  part  can  be  set  by  a  client. 

The  message  header  contains  the  following  data: 

•  T ransaction  id 

•  Requestor’s  arobject  id 

•  Destination  arobject  id 

•  Destination  operation  name 

•  Number  of  arguments 

•  Size  of  message  in  bytes 

It  should  be  noted  that  even  though  message  typing  is  not  supported  at  runtime,  type  checking  of 
messages  between  requestor  and  requestee  can  be  done  solely  at  compile  time.  Since  message 
communication  preserves  message  boundaries,  it  does  not  offer  a  "stream-oriented"  communication 
interface  at  this  level. 

3.4.2  Message  Queue 

There  are  two  types  of  message  queues  associated  with  arobject  and  process  instances.  A 
request-queue  is  allocated  for  an  instance  of  an  arobject.  When  a  new  arobject  instance  is  created, 
the  arobject/process  manager  creates  a  request-queue  associated  with  its  arobject  descriptor.  That 
is.  the  body  of  the  message  queue  is  kept  in  the  kernel  at  the  running  node  of  the  initial  process.  A 
reply-queue  is  allocated  for  an  instance  of  a  process.  When  a  process  issues  an  invocation  request  to 
an  arobject,  its  results  will  be  placed  in  the  reply-queue. 

In  both  message  queues,  each  entry  is  represented  by  a  message  descriptor  and  can  be  checked 
without  accepting  the  message  itself. 

3.4.3  Communication  Manager 

When  an  arobject  invocation  request  is  issued  at  a  caller’s  site.  Ihe  communication  man.ager  sets  up 
a  proper  header  part  for  the  message  packet.  The  message  then  is  sent  to  the  destination  arobject.  If 
the  destination  arobject  is  in  a  remote  node,  the  remote  invocation  protocol  is  used  and  the 
communication  manager  becomes  the  monitor  for  the  protocol.  Similarly,  when  a  process  accesses  a 


178 


shared  private  data  at  a  remote  node,  a  remote  procedure  call  is  used  and  the  communication 
manager  executes  its  protocol. 

3.4.3. 1  Components  of  the  Communicatin  Manager 
Each  Communication  Manager  works  with  a  pair  of  network  I/O  workers  called,  "Netin"  and 
"NetOut"  and  a  group  of  "Stub"  workers. 


Netin  and  NetOut  workers  are  resposible  to  receive  and  transmit  a  message  packet  between  two 
nodes.  A  Stub  worker  is  used  to  perform  a  remote  invocation  request  for  a  shared  privated  data 
object.  When  a  remote  invocation  is  received  by  the  Ccmmunication  Manager,  it  assigns  the  acutual 
work  to  the  stub  worker.  Then,  the  stub  worker  calls  the  target  procedure  with  the  arguments  and 
returns  the  result  packet  to  the  caller.  A  sequence  of  remote  invocation  is  described  in  Section  . 

3. 4. 3. 2  ArchOS  primitives 

The  following  ArchOS  primitives  are  supported  for  communication  management  for  a  client. 


trans-id  *  Request(arobj-id,  opr,  msg,  reply-msg) 
trans-id  =  RequestSingle(arobi-id,  opr,  msg) 
trans-id  =  RequestAllfarobject-name,  opr,  msg) 
pid  =  GetReply(trans-id,  reply-msg) 

(trans-id,  requestor,  opr)  =  AcceptAny(acc-opr,  msg) 

(trans-id,  opr)  *  Accept(requestor,  acc-opr,  msg) 
trans-id  =  Reply(pid,  req-trans-id,  reply-msg) 

ptr-mds  =  CheckMessageQ(qtype,  selector,  selector-id) 

val  =  CaptureCommArobiect(arobj-id,  commtype,  requestor,  req-opr) 
val  =  CaptureCommProcess(pid,  req-opr) 

val  =  WatchCommArobject(arobj-id,  commtype,  requestor,  req-opr) 
val  =  WatchCommProcess(pid,  req-opr) 


TRANSACTION-ID  trans-id 

The  transaction  id  of  the  transaction  on  whose  behalf  the  request  is  being  made. 
AID  arobj-id  The  unique  id  of  the  receiving  arobject. 

OPE-SELECTOR  opr 

The  name  of  the  operation  to  be  performed. 

MESSAGE  *msg  A  pointer  to  the  message  which  contains  the  parameters  of  the  operation  to  be 
performed.  The  message  to  the  destination  arobject  must  not  contain  any 
pointers  (i.e.,  call  by-value  semantics  must  be  used). 


179 


REPLY-MSG  •reply-msg 

A  pointer  to  the  reply  message. 

AROBJ-REFNAME  arobj-refname 

The  reference  name  of  the  receiving  arobject(s). 

ORE-SELECTOR  acc-opr,  req-opr 

The  name  of  operation  to  be  performed.  The  "opr"  parameter  can  be  a  specific 
operation  name  or  "ANYOPR". 

TRANSACTION-ID  req-trans-id 

The  transaction  id  of  the  transaction  on  whose  behalf  the  request  is  made. 
fvISG-DESCRIPTORS  *prt-mds 

Pointer  to  a  list  of  the  message  descriptoVs  selected  by  the  specified  selection 
criteria. 

MSG-Q  qtype,  commtype 

This  indicates  either  "request-"  or  "reply-"  message  queue. 

BOOLEAN  val  TRUE  if  the  primitive  was  executed  properly;  otherwise  FALSE. 

AID  requestor  The  aid  of  the  communicating  arobject. 


The  Request  primitive  provides  remote  procedure  call  semantics  in  which  the  requesting  process 
invokes  an  operation  by  sending  a  message  and  blocks  until  the  receiving  arobject  returns  a  reply 
message.  Identically,  if  the  receiver  arobject  is  the  same  su’object,  a  local  operation  will  be  invoked. 

The  RequestSingle  and  RequestAII  primitives  can  send  a  request  message  and  proceed  without 
waiting  for  a  reply  message.  The  RequestSingle  primitive  provides  nonblocking  one-to-one 
communication  and,  the  RequestAII  primitive  supports  one-to-many  communication.  The  requesting 
process  may  thus  invoke  an  operation  on  more  than  one  instance  of  an  arobject  or  proces.s  with  one 
request.  To  receive  all  of  the  '■eplies,  the  GetReply  primitive  may  be  repeated  until  a  reply  with  a  null 
body  is  received. 

The  GetReply  primitive  receives  a  reply  message  which  has  the  specific  transaction  id  generated  by 
the  preceding  RequestAII  primitive.  If  the  specific  reply  message  is  not  available,  then  the  caller  will 
be  blocked  until  the  message  becomes  available. 

The  process  responsible  for  an  arobject  operation  receives  a  message  using  the  Accept  primitive. 
Using  the  selection  criteria  specified,  the  operating  system  selects  an  eligible  message  from  the 


180 


arobject's  input  queue  and  returns  it.  The  process  operates  on  the  message,  responding  with  a  reply 
when  processing  has  been  completed.  If  no  suitable  message  is  in  the  request  message  queue,  then 
the  caller  will  block  until  such  a  message  becomes  available. 

The  AcceptAny  primitive  can  receive  a  message  from  any  arobject  instance  with  any  operation  (i.e., 
"ANYOPR")  or  a  specified  operation  request.  The  primitive  can  return  the  requestor's  transaction  id, 
specified  operator,  and  requestor’s  aid.  The  Accept  primitive  can  receive  a  message  from  a  specific 
requestor  arobject  and  returns  the  requestor’s  transaction  id  and  the  requested  operator. 

The  CheckMessageQ  primitive  examines  the  current  status  of  an  incoming  message  queue  without 
blocking  the  caller  process.  The  primitive  must  specify  a  message  queue  type,  either  "request- 
queue"  or  "reply-queue".  The  request-queue  queues  all  of  the  non-accepted  request  messages  and 
is  allocated  for  each  arobject  instance.  The  reply-queue  maintains  all  of  the  non-read  reply  messages 
and  is  assigned  to  every  process  instance.  A  message  can  be  selected  beised  on  the  sender’s 
arobject  id,  operation  name,  and/or  transaction  id.  If  more  than  one  argument  is  given,  only 
messages  which  satisfy  all  of  the  conditions  will  be  returned.  If  no  corresponding  message  exists  in  a 
specified  message  queue,  a  "NULL-POINTER"  will  be  returned. 

3.4.4  Remote  Invocation  Protocol 

A  remote  invocation  protocol  is  used  to  control  the  invocation  of  an  operation  on  a  remote  arobject. 

In  a  simple  remote  invocation  case,  two  packets  will  be  exchanged  between  two  nodes:  one  is  a 
request  packet  and  the  other  is  reply  packet.  A  simplified  version  of  protocol  sequence  is  shown  in 
Figure  3-11. 

When  a  remote  request  primitive  is  issued  by  a  client,  the  base  kernel  pass  its  request  to  the 
communication  manager.  Then,  Ihe  coinimuiication  manager  places  a  request  into  an  outgoing 
packet  queue  and  gives  it  to  a  "NetOut"  worker.  The  NetOut  worker  simply  initiates  the  actual  output 
activity  over  the  net.  Note  that  the  NetOut  worker  can  be  replaced  by  a  simple  procedure  call  within 
the  communicaiton  manager  if  the  context  switching  is  significantly  large. 

When  a  remote  site  receives  the  request  packet,  the  "Netin"  worker  of  the  communication  manager 
places  the  packet  in  its  incoming  packet  queue  and  notify  it  to  the  communication  manager.  If  the 
destination  arobject  is  ready  to  accept  the  request,  ihe  communication  manager  moves  the  actual 
message  to  the  destination  arobject’s  receiving  buffer.  Otherwise,  the  comhiunication  manager 
places  the  message  in  the  destination  arobject's  request  queue. 


3.4.5  RPC  for  Shared  Private  Objects 

@Lable(rpcsec) 

When  a  client  invoke  an  operation  on  a  shared  private  object  exists  on  a  remote  node,  the 
communication  manager’s  stub  worker  performs  the  actual  invocation  and  returns  the  result  to  the 
caller. 

At  first,  the  private  object  invocation  request  to  a  is  checked  by  looking  at  the  shared  object  table  in 
the  shared  header  segment.  If  the  target  object  is  in  a  remote  node,  a  remote  invocation  request, 
InvokePrivaie  will  be  sent  from  the  communication  manager. 

When  the  remote  "Netin"  worker  receives  the  packet,  it  checks  availabiility  of  the  stub  workers.  If 


182 


Figure  3-12:  Interaction  Sequence  for  a  Remote  Invocation  Request 


one  of  stub  workers  is  idle,  it  assigns  the  actual  invocation  request  to  the  stub  worker.  When  the  stub 
worker  completes  the  invocation  request,  it  returns  the  result  message  to  the  communication 
manager.  Then,  the  communication  manager  forwards  the  result  packet  to  the  caller. 

The  basic  sequence  within  in  a  remote  communication  manager  is  shown  in  Figure  3-13. 

3.5  Transaction  Subsystem 

The  transaction  subsystem  manages  two  types  of  transactions:  compound  transactions  and 
elementary  transactions.  The  transaction  subsystem  allows  a  client  to  nest  a  compound  transaction 
or  an  clementar/  transaction  in  any  combination.  By  using  the  nested  elementary  transactions,  a 
client  can  use  a  traditional  "nested  transactions”  mechanism  (Moss  81].  A  compound  transaction 
can  be  used  on  a  "compensatable"  atomic  arobject.  [Sha  85,  Tokuda  85]  Thus,  the  trans  ction 
subsystem  must  provide  the  uniform  control  mechanisms  to  perform  a  "commit"  operation  for  both 
types  and  an  "undo"  operation  for  the  elementary  transactions  and  a  "compensation"  operation  for 
the  compound  transactions. 


It  should  be  noted  that  the  current  transaction  subsystem  does  not  provide  .'iny  mechanisms  which 
support  a  "cooperating"  transaction. 


183 


Net  In 


Stub  Worker 


Arobject  B 


Privete 

Object 


At  Node  X 


At  Node  Y 


Figure  3- 1 3:  Remote  Procedure  Call  Sequence  at  a  Remote  Communication  Manager 


3.5.1  Transaction  Types,  Scopes,  and  Tree 

•  T ransaction  Types; 

A  transaction  type  indicates  whether  the  current  transaction  is  the  elementary  or 
compound  transaction  and  the  top-level  or  not. 

•  T ransaction  Scope; 

A  transaction  scope  is  created  when  a  new  transaction  construct  is  used  in  a  process. 
Within  this  scope,  a  client  can  access  atomic  objects  as  if  these  computational  steps 
were  executed  alone. 

•  T ransaction  T ree; 

A  transaction  tree  is  a  structure  that  can  be  used  to  trace  the  dynamic  behavior  of  a  set  of 
transactions.  The  transaction  tree  also  provides  information  related  to  the  commit 
protocol  and  lock  propagation  among  the  nested  transactions. 

•  Transaction  id; 

The  transaction  subsystem  gets  a  unique  transaction  id  from  the  base  kernel  when  a  new 
transaction  is  initiated  and  guaranteed  to  be  unique  over  the  lifetime  of  the  transaction.  A 
transaction  id  is  a  fixed-length  descriptor  consisting  of  a  sequence  of  (jarent's  node  id, 
current  node  id,  and  /oca/  unique  id. 


184 


3.5.2  Transaction  Management 

The  transaction  manager  is  responsible  for  maintaining  the  consistency  of  the  atomic  arobjects 
even  in  the  case  of  a  node  failure.  The  major  activity  of  the  transaction  manager  is  the 
"BeginTransaction",  "Commit"  and  "Abort"  transaction  processing. 

To  commit  a  transaction,  the  transaction  manager  coordinates  with  the  related  page  set  subsystem 
and  performs  atomic  update  for  all  atomic  objects  the  transaction  has  visited.  The  actual 
coordination  is  performed  based  on  the  3-phase  commit  protocol  [Bernstein  83]  to  improve  its 
reliability. 

To  abort  a  transaction,  the  transaction  manager  must  also  coordinate  with  the  page  set  subsystem, 
but  it  performs  "undo"  for  elementary  transactions  and  initiates  the  necessary  "compensation 
operations"  for  all  pre-committed  compound  transactions. 

3.5.2. 1  Components  of  the  Transaction  Manager 

Each  Transaction  .Manager  works  with  a  group  of  workers,  called  ’Coordinator"  and 
"Subordinator".  A  "Coordinator"  is  assigned  to  keep  track  the  activity  of  a  top-level  transaction. 
When  the  top-level  transaction  is  committed,  the  "Coordinator"  worker  is  responsible  to  perform 
3-phase  commit  protocol  among  the  related  subtransactions.  A  "Suordinator"  is  used  to  keep  track 
the  activity  of  a  nested  transaction. 

When  a  new  transaction  is  created,  a  transaction  descriptor  is  created  in  the  Transaction  Manager 
and  is  used  to  maintain  the  current  status  of  the  transaction  and  relationship  to  its  child  transactions. 

3. 5. 2. 2  ArchOS  primitives 

The  following  ArchOS  primitives  are  supported  for  transaction  management  for  a  client. 

tid  =  BeginTransaction(trantype,  timeout) 

sval  =  CommitTransaction(tid) 

sval  =  AbortTransaction(tid) 

sval  =  AbortlncompleteTransaction(req-tid) 

tid  =  SelfTidO 

ptid  =  ParentTid(tid) 

trantype  =  TransactionType(tid) 

val  -  IsCommittcd(tid) 

val  =  IsAborted(tid) 

TIME  timeout  The  timeout  value  indicates  the  maximum  lifetime  of  this  elementary  transaction. 


TRANTYPE  trantype 


185 


The  type  of  the  given  transaction,  such  as  "CT”,  "ET",  "Nested  CT",  or  "Nested 
ET". 


TID  tid 
TID  ptid 
INT  sval 


BOOLEAN  val 


The  id  of  the  specific  transaction. 

The  parent’s  tid  of  the  given  transaction  "tid". 

returns  1  if  the  requested  action  was  performed  on  the  specified  transaction 
successfully;  otherwise  returrrs  error  staus  code. 

TRUE  if  the  requested  predicate  is  hold;  otherwise  returns  FALSE. 


The  BeginTransaction  primitive  creates  a  transaction  descriptor  for  the  requested  transaction.  The 
CommitTransaction  primitive  commits  the  specified  transaction. 

The  AbortTransaction  primitive  aborts  the  specified  transaction  and  all  of  its  child  transactions 
within  the  same  transaction  tree.  If  the  transaction  that  invokes  the  AbortTransaction  primitive  does 
not  belong  to  same  the  transaction  tree  as  the  transaction  which  is  to  be  aborted,  a  client  cannot 
abort  that  transaction.  This  primitive  executes  all  of  the  necessary  "undo"  or  "compensate"  actions, 
based  on  the  transaction  type,  and  breaks  the  current  transaction  scope,  .^^ter  completion  of  these 
actions,  the  status  of  all  affected  atomic  objects  will  be  consistent  and  returned  to  either  "identical" 
to  or  "a  member  of  the  equivalence  class"  of  their  initial  (pre-execution)  states.  The 
AbortincompleteTransaction  primitive  also  aborts  all  of  the  outstanding  incomplete  transactions 
which  had  been  initiated  by  an  outstanding  RequestSingle  or  RequestAII  primitive.  In  other  words,  all 
of  the  nested  transactions  which  belong  to  the  specified  request  transaction  but  have  not  yet 
completed  (committed)  will  be  aborted. 

The  SelfTid  primitive  returns  the  id  of  the  current  transaction  and  the  ParentTid  primitive  returns  the 
p  ant  transaction  id  of  the  given  transaction  id.  The  TransactionType  primitive  returns  the  type  of 
the  given  transaction  (a  compound  or  elementary)  and  also  indicates  the  transaction  level.  A 
IsCommitted  primitive  checks  whether  the  given  transaction  is  already  committed  or  not.  A  isAborted 
primitive  checks  whether  the  given  transaction  is  already  aborted  or  not. 

3.5.3  Three-Phase  Commit  Protocol 

When  a  top-level  transaction  is  created,  its  node’s  transaction  manager  becomes  a  primary 
coordinator  to  perform  the  3- phase  commit  protocol  and  the  other  transaction  manageis  where  at 
least  one  nested  transaction  was  executed  become  subordinators.  These  subordinator  transaction 
managers  are  also  used  for  backup  of  the  primary  manager. 


186 


The  current  3- phase  commit  protocol  [Bernstein  83]  can  be  summarized  as  follows: 


1 .  The  transaction  manager,  say  TM.,  which  is  responsible  to  a  top-level  transaction,  T^^p, 
invokes  "prepare"  operations  at  all  visited  node’s  transaction  managers  by  using  a 
RequestAII  primitive.  Then,  it  waits  for  all  transaction  managers  to  acknowledge  the 
"prepare"  operation. 


2.  TM.  invokes  "precommit"  operations  on  all  the  other  transaction  managers  which  involve 
Tjop.  When  the  other  transaction  managers  initiate  the  "precommit"  operation,  they  add 
a  new  entry,  T.  to  its  copy  of  "commit  list"  of 
acknowledgements  for  "precommit"  to  come  back. 


top 


Then,  they  wait  for  all 


3.  TM.  perform  "commit"  operations  at  all  related  transaction  managers. 


3.5.4  Compensation  Action  Management 

A  transaction  manager  must  initiate  a  compensation  action  whenever  a  compound  transaction  is 
ciborted  and  must  guarantee  that  the  effects  of  all  committed  actions  are  cancelled  out.  In  general, 
the  ordering  relation  among  the  compensation  operations  is  sensitive  to  satisfy  local  and  global 
"equivalence  relations",  thus  the  transaction  manager  must  maintain  a  clear  ordering  rule.  In  the 
current  algorithm,  we  take  the  reverse  ordering  of  the  "committed"  sequence  of  operations. 

The  book  keeping  is  performed  by  using  a  "compensation  log"  at  each  arobject.  When  an  arobject 
invokes  a  compensatable  operation  on  another  arobject  as  a  nested  transaction,  a  "compensation 
'og"  record  v/hich  consists  of  the  current  transaction  id  (i.e.,  caller’s  transaction  id),  arobject  name, 
operation  name,  parameters,  the  compensation  operation’s  name,  necessary  parameters  for  the 
compensation  operation,  will  be  added  on  the  caller’s  log.  When  the  caller’s  transaction  is  aborted, 
the  system  first  checks  the  "cancellation  relation"  between  the  current  operation  and  the  previous 
operations  by  getting  the  information  from  the  arobject  and  then  the  system  can  simply  determine  the 
necessary  compensation  operations  by  looking  at  the  compensation  log  record  from  the  end  to  the 
beginning  and  invoke  them. 

3.5.5  Lock  Management 

Proper  lock  management  is  necessary  to  provide  a  consistent  view  of  atomic  data  object  across  the 
transaction  tree. 

When  a  nested  elementary  transaction  commits,  all  of  the  loc' s  that  it  held  are  passed  to  the 
transaction  in  which  it  is  nested  (its  parent  in  the  transaction  tree);  when  a  nested  compound 
transaction  commits,  all  of  its  locks  are  released.  The  rules  that  determine  which  locks  a  transaction 
may  obtain  are  more  involved.  Two  transactions  are  said  to  oe  unrelated  if;  (1 )  they  are  not  containeo 


187 


in  a  single  transaction  tree,  or  (2)  they  are  in  a  single  transaction  tree  and  are  concurrently  executing 
siblings  or  descendants  of  concurrently  executing  siblings,  or  (3)  they  are  in  a  single  transaction  tree 
in  which  one  is  an  ancestor  of  the  other  and  either  the  descendant  is  a  compound  transaction  or 
there  is  a  compound  transaction  on  the  path  connecting  the  two  transaction  nodes  in  the 
corresponding  transaction  tree.  (Note  that  according  to  this  definition,  a  compound  child  transaction 
is  always  unrelated  to  its  parent  transaction.) 

Two  unrelated  transactions  may  compete  for  locks,  and  they  may  hold  locks  with  compatible  lock 
modes  for  a  single  data  object  at  any  given  time.  However,  if  they  request  incompatible  lock  modes 
for  a  single  data  object,  then  one  of  the  cornpetitors  will  obtain  a  lock  and  the  other  will  block  until  it 
can  receive  the  desired  lock,  or  it  will  return  to  the  requestor  with  an  appropriate  status  indication. 

Two  transactions  are  said  to  be  related  if  they  are  contained  in  a  single  transaction  tree  where  one 
is  the  descendant  of  the  other,  the  descendant  is  an  elementary  transaction,  and  there  are  no 
compound  transactions  in  the  path  connecting  their  respective  nodes  in  the  transaction  tree.  The 
lock  compatibility  rules  for  related  transactions  are  different  than  those  for  unrelated  transactions.  In 
this  case,  the  descendant  transaction  can  obtain  any  lock  mode  for  any  lock  held  by  a  related 
ancestor  in  the  transaction  tree,  including  incompatible  lock  modes  that  v/ould  not  be  allowed  if  the 
transactions  were  unrelated,  (Of  course,  the  descendant  transaction  will  have  to  compete  with  all  of 
the  unrelated  transactions  in  the  system  to  successfully  obtain  the  requested  lock  with  the  desired 
mode.) 

The  following  ArchOS  primitives  are  supported  for  lock  management  for  a  client: 

newlock-id  =  CreateLock(  [parent-lockid] ) 
val  =  DeleteLock(lockid) 

sval  =  SetLock(locl<-type,  lockid,  lock-mode) 
tval  =  TestandSciLock(lock-type,  lockid,  lock-mode) 
tval  =  TestLock(lock-type,  lockid,  lock-mode) 
rval  =  ReleaseLock(lock-type,  lockid,  lock-mode) 

INT  sval  1  if  the  specified  lock  is  set;  0  if  the  lock  is  not  set.  A  negative  value  will  be 

returned  if  an  error  occured. 

INT  tval  1  if  the  specified  lock  is  being  held;  0  if  the  lock  is  not  being  held.  A  negative 

value  will  be  returned  if  an  error  occured. 


INT  rval 


1  if  the  specified  lock  is  released;  0  if  the  lock  is  not  released.  A  negative  value 
will  be  returned  if  an  error  occured. 


188 


LOCK-TYPE  lock-type 

The  lock  type  can  be  either  "TREE"  or  "DISCRETE". 

LOCK-ID  lockid  The  lockid  indicates  the  unique  id  of  a  lock. 

LOCK-MODE  lock-mode 

The  lock  mode  can  be  "READ",  "WRITE",  etc. 


The  SetLock  primitive  sets  a  "tree-type"  or  "discrete- type"  lock  on  arbitrary  objects  by  specifying  a 
lock  key  and  its  mode.  If  a  requested  lock  is  being  held,  the  caller  v/ill  block  until  it  is  released.  The 
TestandSetLock  primitive  also  tries  to  set  a  lock,  however,  it  will  return  a  "FALSE"  if  the  lock  is  being 
held.  If  the  request  lock  is  a  tree-lock  type,  then  the  SetLock  and  TestandSetLock  primitives  may  also 
fail  due  to  the  violation  of  the  tree-lock  convention  (See  Section  3.5.5). 

The  TestLock  primitive  checks  the  availability  of  a  specified  lock  with  a  lock  mode.  In  the  case  of  a 
tree  lock,  it  also  checks  whether  the  locking  would  be  legal  in  the  corresponding  lock  tree.  The 
ReleaseLock  primitive  can  release  the  lock  on  an  object  which  was  gained  by  the  SetLock  or 
TestandSetLock  primitive  explicitly. 

3.5.6  Recovery  Management 

Since  automatic  recovery  of  application  programs  is  not  an  easy  task  for  the  transaction  subsystem, 
the  subsystem  guarantees  only  the  consistency  of  atornic  arobjects  and  provides  a  handle  to 
distinguish  the  pre-crash  and  post-crash  situation.  Each  arobject  desigqpr  must  define  a  proper 
action  for  recovery  and  let  the  "INITIAL"  process  handle  a  detailed  recover  sequence. 

In  ArchOS,  each  arobject  has  an  atomic  variable,  called  the  "restart-counter"  which  becomes 
incremented  whenever  its  ho.st  machine  restarts.  By  using  the  restart  counter,  the  INITIAL  process 
can  examine  whether  the  program  is  running  before  a  crash  or  not. 


To  clean  up  and  retrieve  the  necessary  atomic  arobjects,  the  transaction  subsystem  must 
coordinate  with  the  page  set  manager.  In  particular,  the  atomic  page  set  manager  can  recover  the 
nessary  page  set  for  a  given  atomic  arobject. 


189 


3.6  File  Management  Subsystem 

The  ArchOS  File  Management  Subsystem  provides  a  system-wide,  location  independent  file  access 
service.  It  provides  three  types  of  files:  normal,  permanent,  and  atomic.  Normal  files  are  temporary, 
and  not  recoverable  following  a  crash.  Permanent  files  are  saved  on  disk,  auid  most  closely  resemble 
files  on  other  systems.  Atomic  files  are  guaranteed  to  be  failure  atomic  under  "soft  and  clean" 
failures  [Bernstein  83].  All  three  types  of  files  are  client  level  au’objects  when  active  (i.e.  open).  When 
inactive,  permanent  and  atomic  files  are  saved  using  page  sets  (See  Section  3.7). 

The  implementation  of  files  as  regular  client  level  arobjects  achieves  two  main  advantages.  First, 
the  full  power  of  the  ArchOS  transaction  facilities  is  available,  permitting  the  arbitrary  nesting  of 
atomic  file  operations  within  other  nested  transactions.  Second,  theiile  arobjects  are  treated  in  the 
same  manner  as  other  arobjects  by  the  Time-driven  Scheduler  and  the  Time-driven  Virtual  Memory 
Manager.  This  ensures  that  critical  files  (data)  will  be  available  (scheduled  and  paged  in)  when  they 
are  to  be  accessed  by  critical  processes,  thus  reducing  the  number  of  missed  deadlines. 

The  File  Subsystem  uses  logical,  location  independent  names  for  files.  Hence,  files  are  not 
constrained  to  reside  on  particular  disk  volumes.  They  can  be  dynamically  moved  to  other  disks  by 
the  system,  if  desired,  for  improved  global  disk  usage.  The  file  name  space  is  flat  (not  hierarchical), 
but  facilities  are  provided  to  allow  many  of  the  benefits  of  hierarchical  directories.  File  names  are 
expected  to  be  relatively  long  strings,  composed  of  several  shorter  strings  (called  components), 
separated  by  the  special  character '/’  (slash).  An  initial  substring  of  a  name  (upto  a  slash)  is  known 
as  a  prefix. 

The  system- wide  file  system  directory  is  partitioned,  where  each  partition  is  saved  in  a  known 
location  (refer  Figure  3-14).  For  example,  the  part  of  the  directory  containing  all  file  names  with  a 
particular  prefix  is  saved  at  one  logical  disk.  This  style  of  implementation  was  dictated  by  efficiency 
considerations,  such  as  in  creating  and  locating  files.  Facilities  are  provided  for  moving  parts  of  the 
directory  from  one  disk  to  another,  but  this  is  expected  to  occur  infrequently. 

The  functionality  of  the  File  Management  Subsystem  is  provided  by  four  separate  but  co-operating 
entities  (refer  Figure  3-15). 

1.  Each  client  arobject  is  linked  with  a  standard  file  system  interface.  This  interface  consists 
of  a  library  of  routines,  which  hide  the  implementation  details  of  the  other  subsystems 
within  the  File  Management  Subsystem. 

2.  Each  open  (active)  file  is  represented  by  an  instance  of  a  user  level  file  type  arobject. 


190 


Node  1 

Node-  2 

Prefix  Map  Manager  1 

Pref ix  Map  Table : 

/;  Logical  Disk  1.  Directory  Manager  t 
/Oin:  Logical  Disk  1,  Directory  Manager  1 
/etc:  Logical  Disk  2,  Directory  Manager  2 

/usr/tmp;  Logical  Disk  3,  Directory  Manager  3 

Prefix  Map  Manager  2 

Pref ix  Map  Table: 

/:  Logical  Disk  1.  Directory  Manager  1 
/bln:  Logical  Disk  1.  Directory  Manager  1 
/etc:  Logical  Disk  2.  Directory  Manager  2 

/usr/tmp:  Logical  Disk  3.  Directory  Manager  3 

Di rectory 
Manager 

1 

Directory' 

Manager 

2 

Di rectory 
Manager 

3 

Di rectory 
Manager 

4 

Logical  Disk  1 

Di rectory 
Prefixes ; 

/ 

/tmp 

/bin 

Logical  Disk  2 

Di rectory 
“ref ixes : 

/etc 

/usr 

Logical  Disk  3 

Di rectory 

Pref ixes: 

/usr/src 

/usr/tmp 

/usr/bln 

Logical  Disk  4 

Di rectory 

Pref ixes : 

/usr/sys 

/usr/lib 

/usr/John 

Figu  re  3- 1 4:  The  Partitioned  File  Subsystem  Directory  Structure 


3.  Each  node  in  the  system  contains  a  kernel  level  Prefix  Map  Management  Arobject,  which 
is  responsible  for  mapping  file  name  prefixes  lo  the  logical  disks  on  which  their  directory 
entries  reside. 

4  Each  logical  disk  in  the  system  containing  a  portion  of  the  directory  has  an  associated 
kernel  level  Directory  Management  Arobject. 

The  above  four  entities  will  be  described  in  some  detail  in  the  following  sections. 

3.6.1  Client  Arobject’s  File  System  Interface 
A  client  arobject  will  always  interact  with  the  file  system  through  the  use  of  a  standard  set  of  library 
routines.  These  routines  are  together  referred  to  as  the  Client  Arobject’s  File  System  Interface  (or  the 
CA  Interface).  The  primary  purpose  of  the  CA  Interface  is  to  transparently  interact  with  the  three 
other  subsystems  of  the  file  system  (File  Arobjects,  Prefix  Map  Manager,  and  Directory  Manager)  in 
supporting  client  level  functionality.  Thus,  it  hides  the  details  of  the  interactions  with  the  other  parts 
of  the  File  Subsystem,  and  provides  a  cleaner  interface  than  the  raw  system  primitives. 


A  second  aspect  of  the  CA  Interface  s  functionality  is  that  it  provides  appropriate  internal  buffering 
to  improve  efficiency.  For  open  files,  it  also  keeps  track  of  the  sizes  of  the  files. 


CLIENT 

LEVEL 


C11«nt  Arobjtct 


FIT* 

AroOject 

File 

Arobject 

File 

Arobject 

FIT* 

Arobject 

on*  p*r 
op*n  Til* 


KERNEL 

LEVEL 


Prefix  Map  Manager 


ona  par 
noda 


Directory 

Manager 


Directory 

Manager 


Directory 

Manager 


on*  per  ditk 
•1th  directory 
on  It 


Figure  3'  1 5:  Subsystems  within  the  File  Management  Subsystem 


The  following  list  of  primitives  is  intended  to  illustrate  the  functionality  of  the  ArchOS  standard  file 
I/O  library,  it  is  not  an  exhaustive  list.  Many  other  primitives  can  be  added  to  make  file  management 
more  convenient  for  the  client. 


192 


fid  =  CreateFile(filename,  filetype,  mode[,  node]) 

val  s  DeieteFile(filename) 

fid  3  OpenFileffilename,  mode) 

val  »  CloseFile(fid) 

val  3  RenameFilefoldname,  newname) 

val  3  SyncDirectory(filename) 

val  3  SetPrefix(prerix) 
prefix  3  GetPrefixQ 

nread  3  ReadFile(fid,  buffer,  length) 
nwritten  s  WriteFileffid,  buffer,  length) 
nwritten  3  ZeroFiteffid,  length) 
val  3  SeekFileffid,  byteoffset,  origin) 
val  3  SyncFile(fid) 

val  3  StatusFileffid,  statusbuffer) 
val  3  InfoFileffilename,  infobuffer) 

INTfid  The  ID  for  the  file. 

BOOLEAN  val  TRUE  if  the  specified  operation  is  done  successfully:  otherwise  FALSE. 

FILENAME  prefix  The  filename  prefix  currently  In  use,  which  will  be  added  to  any  filename  tails 
being  specified. 

INT  nread  The  number  of  bytes  actually  read. 

INT  nwritten  The  number  of  bytes  actually  written. 

-ILENAME  filename 

The  name  of  the  file  for  which  the  specified  operation  is  to  be  performed. 

FILETYPE  filetype  The  h/pe  of  the  file  to  be  created;  NORMAL,  PERMANENT,  or  ATOMIC. 

MODE  mode  One  of  four  possible  modes  In  which  the  file  can  be  opened:  READ,  WRITE, 
READ-ONLY,  and  EXCLUSIVE-WRITE. 

NODENAME  node  The  ID  of  the  preferred  node  on  which  the  file  should  be  created. 

FILENAME  oldname,  newname 

The  old  and  new  names  (respectively)  of  the  file  being  renamed. 

BUFFER  ‘buffer  The  address  for  the  data  buffer. 

INT  length  The  number  of  bytes  to  be  read  or  written. 


193 


INT  byteoffset  The  position  in  number  of  bytes  from  the  origin. 

FILEORIGIN  origin  Starting  position  in  the  file,  which  can  have  one  of  three  values:  START-OF-FILE, 
END-OF-FILE,  and  CURRENT-POSITION. 

FSTATUSBUFFER  'statusbuffer 

The  buffer  address  for  returning  dynamic  file  status  information. 

FINFOBUFFER  ‘infobuffer 

The  buffer  address  for  returning  static  file  information. 


The  CreateFile  primitive  creates  a  file  of  the  specified  type  (NORMAL,  PERMANENT,  or  ATOMIC)  on 
the  preferred  computing  node.  If  a  node  preference  is  not  specified,  the  file  is  created  on  any  node. 
The  newly  created  file  is  then  opened  in  the  specified  mode.  The  DeleteFile  primitive  deletes  the 
specified  file.  OpenFUe  opens  the  specified  file  in  one  of  four  modes:  READ,  WRITE,  READ-ONLY, 
and  EXCLUSIVE-WRITE.  The  CloseFile  primitive  closes  the  specified  file.®  RenameFile  changes  the 
name  of  the  file  from  the  old  name  to  the  new  name  specified.  The  SyncDirectory  primitive  is  used  for 
saving  any  buffered  directory  information  for  the  specified  file  on  disk. 

The  SetPrefix  primitive  provides  a  short  hand  technique  for  giving  filenames.  A  file  prefix  can  be 
specified,  which  is  automatically  concatenated  with  filename  tails  to  form  the  complete  file  name.  The 
GeiPrefix  primitive  allows  the  client  to  obtain  the  current  value  of  the  file  prefix. 

The  ReadFile  primitive  is  used  for  reading  the  specified  number  of  bytes  from  a  file,  starting  at  the 
current  position.  The  bytes  read  are  returned  in  a  buffer  in  the  client’s  address  space.  Similarly,  the 
WriteFile  primitive  writes  the  specified  number  of  bytes  from  the  buffer,  at  the  current  position  in  the 
file.  The  ZeroFUe  primitive  is  used  for  writing  the  specified  number  of  zero  bytes,  starting  at  the 
current  position  in  the  file.  The  ability  to  zero  a  file  is  useful  when  truncating  files,  and  creating  sparse 
files.  The  SeekFile  primitive  allows  random  access  to  a  particular  position  in  the  file.  The  position  can 
be  specified  as  a  byte  offset  from  the  start  or  end  of  the  file,  or  from  the  current  position  in  the  file. 
The  SyncFile  primitive  flushes  the  contents  of  a  file  from  buffers  in  memory  onto  disk. 

The  StatusFUe  primitive  is  used  for  obtaining  dynamic  status  information  of  a  file  in  the  buffer 
provided.  Information  such  as  the  current  mode  of  file  access,  and  the  number  of  active  and 
outstanding  open  requests  on  the  file  is  provided.  The  InfoFile  primitive,  on  tfie  other  hand,  provides 
static  file  information,  such  as  creation  date,  last  modification  date,  length  of  file,  etc. 

2G 

All  open  files  lor  a  client  are  registcied  with  the  kernel  by  the  CA  Intetlace.  Hence,  if  .nn  arobiect  with  open  files  is 
destroyed,  the  kernel  c.in  close  all  these  open  files.  The  occurrence  of  each  of  the  three  primiliver,  CreateFile,  OponrUe,  and 
Close-File.  IS  reported  to  the  kernel. 


J 


194 


3.6.2  File  Arobjects 

Each  open  file  in  the  system  is  represented  by  a  client  level  file  arobject.  There  are  three  different 
types  of  file  arobjects,  corresponding  to  the  three  different  types  of  files:  normal,  permanent,  and 
atomic.  A  file  arobject  maintains  the  "file  buffer",  and  provides  byte  level  file  I/O.  The  file  is  mapped 
onto  the  virtual  address  space  of  the  file  arobject,  and  the  buffer  is  maintained  automatically  by  the 
Virtual  Memory  Subsystem.  The  two  main  functions  of  a  file  arobject  are:  (1)  maintaining  locks,  and 
thus  ensuring  consistent  concurrent  access  to  thfe  file,  and  (2)»handling  read  and  write  operations  on 
the  file.  Locks  are  set  on  the  entire  file.  When  the  file  is  opened,  the  type  of  access  is  specified,  and 
the  appropriate  lock  is  set.  The  lock  is  released  only  when  the  file  is  closed.  In  addition  to  lock 
management,  reading  and  writing,  a  few  other  primitives  such  as  FASync  and  FAStatus  are  also 
provided. 


filesize  =  FAOpen(mode) 
val  =  FACIoseO 

nbr  =  FARead(location,  nbytes,  buffer) 
nbw  =  FAWrite(location,  nbytes,  buffer) 
nbw  a  FAZero(location,  nbytes) 

val  =  FASyncO 

val  a  FAStatus(statusbuffer) 

val  a  FA  Restart(fdm-aid,  filesize) 


INT  filesize 
BOOLEAN  val 
INT  nbr 
INT  nbw 
MODE  mode 


The  size  of  the  file  in  bytes. 

TRUE  if  the  specified  operation  is  done  successfully;  otherwise  FALSE. 

The  actual  number  of  bytes  read. 

The  actual  number  of  bytes  written. 

One  of  four  possible  modes  In  which  the  file  can  be  opened:  READ,  WRITE, 
READ-ONLY,  and  EXCLUSIVE-WRITE. 


INT  location  The  location  (in  bytes  from  the  beginning  of  the  file)  within  the  file  at  which 

reading  or  writing  starts. 

INT  nbytes  The  number  of  bytes  to  be  read  or  written. 

BUFFER  ‘buffer  The  address  for  the  data  buffer. 

FASTATUSBUFFER  ‘statusbuffer 

The  buffer  address  for  returning  status  information. 


195 


AID  fdm-aid  The  ID  of  the  Directory  Management  arobject,  to  be  used  when  closing  the  file. 


The  FAOpen  primitive  allows  a  file  to  be  opened  in  one  of  four  possible  modes:  "READ",  "WRITE", 
"READ-ONLY",  amd  "EXCLUSIVE-WRITE".  These  modes  apply  to  ail  three  types  of  files,  but  the 
behavior  of  a  mode  does  depend  on  the  type  of  file.  Once  the  file  is  opened  in  a*  particular  mode,  the 
appropriate  type  of  lock  is  set  on  it  to  ensure  consistency.  The  size  of  the  file  is  returned  to  the  client 
arobject.  The  file  size,  as  well  as  Ihe  current  position  pointer  are  maintained  by  the  Client  Arobject 
Interface.  The  FACIose  primitive  closes  the  file.  It  also  deactivates  the  file  arobject  if  there  are  no 
other  outstanding  "opens"  on  the  file. 

The  FARead  and  FAWrite  operations  allow  multiple  bytes  to  be  read  or  written  afa  time.  The  current 
position  at  which  reading  or  writing  should  start,  and  the  number  of  bytes  to  be  read  or  written  are 
provided,  along  with  a  pointer  to  the  data  buffer^  The  actual  number  of  bytes  read  or  written  is 
returned  by  the  operation.  The  FAZero  primitive  allows  a  number  of  bytes  inside  of  the  file  to  be 
zeroed  out.  Thus,  it  is  possible  to  truncate  a  file,  or  maintain  a  sparse  file  in  this  system. 

The  FASync  operation  flushes  the  contents  of  the  file  buffer  onto  the  disk,  so  that  any  buffered 
information  is  synchronized  v/ith  the  copy  of  the  file  stored  on  disk.  The  FAStatus  primitive  returns 
dynamic  file  information  in  the  statusbuffer  provided.  Information  is  provided  about  the  mode  of  file 
access,  the  lock  compatibility  of  open  files,  and  the  number  of  active  and  outstanding  open  requests. 

The  FARestart  operation  is  used  for  initialization  purposes  when  the  file  arobject  is  first  created 
(which  can  happen  when  the  file  is  created,  or  first  opened).  It  is  invoked  by  an  instsmce  of  the 
Directory  Management  arobject,  which  provides  its  own  ID  and  the  current  size  of  the  file  as 
parameters.  The  ID  of  the  Directory  Management  arobject  is  used  by  the  file  arobject  when  the  file  is 
closed  by  any  client. 

3.6.2. 1  Components  of  a  File  Arobject 

The  components  of  a  File  Arobject  are  described  below  and  shown  in  Figure  3-16.  Each  file 
arobject  has  a  single  process  called  the  FA  Manager  Process,  which  receives  all  requests  for 
operations,  carries  out  the  appropriate  operations,  and  returns  the  results.  It  operates  on  three  main 
data  structures:  (1)  the  Open  Client  List  (OCL),  (2)  the  Request  Client  List  (RCL),  and  (3)  the  file 
buffer.  Tho  Open  Client  List  maintains  a  list  of  the  clients  which  have  successfully  opened  this  file. 


30 


If  the  rii'.Tit  IS  remete  v/iih  icspect  to  the  *ilG  .urobject,  the  communication  mechanism  automatically  handles  data  transfer 


to  the  remote  hulfer. 


196 


and  the  mode  of  access  for  each  client.  Since  locking  is  done  on  a  per  file  basis,  all  these  clients 
must  have  compatible  locking  on  the  file.  The  Request  Client  List  is  a  list  of  the  clients  waiting  for 
their  open  operations  to  be  completed.  The  information  maintained  for  each  request  is  a  triple 
consisting  of  the  request  transaction  ID,  the  ID  of  the  requesting  client  arobject,  and  the  mode  of 
access  requested. 


(requestor,  mode) 


(req-tld.  requestor,  mode) 


Open 

Request 

Client 

Client 

List 

Lilt 

AAVWN 

WWVN 

□ 


“modified" 

flag 


fnes1*e 


Count  of 
Chunks 


Figu  re  3- 1 6;  Components  of  a  RIe  Arobject 


The  file  buffer  is  a  part  of  the  file  arobject's  address  space.  It  is  maintained  in  a  number  of  fairly 
large  chunks  (32  KBytes  per  chunk^^).  A  list  of  pointers  to  these  chunks  and  a  count  of  the  number  of 
chunks  are  also  maintained.  The  advantage  of  large  chunks  is  that  a  long  list  of  chunk  pointers  is  not 
needed,  and  hence  searching  for  a  particular  page  in  the  file  can  be  more  efficient. 


31 

Tho  size  of  the  chunks  is  determined  by  the  architecture  of  the  SUN  workstations,  the  cunonl  target  machinor,  for  ArchOS 
implementation. 


197 


In  addition  to  the  three  main  data  structures,  a  few  other  pieces  of  information  are  also  maintained. 
Some  of  these  are  the  ID  for  the  Directory  Management  arobject,  a  flag  which  indicates  whether  the 
file  has  been  modified,  and  a  variable  which  saves  the  current  sire  of  the  file^. 

3. 6. 2. 2  Normal  and  Permanent  File  Manipulations 

Opening  of  files  is  allowed  to  proceed  on  a  FCFS  basis^,  but  multiple  compatible  opens  are 
permitted.  When  a  FAOpen  request  is  received  ijy  the  FA  Manager  process,  the  Open  Client  List  is 
first  checked  to  see  whether  there  are  other  ongoing  requests.  If  the  OCL  is  empty,  the  incoming 
request  is  placed  on  that  list.  If  the  Request  Client  List  has  some  waiting  clients,  then  the  new  request 
is  added  to  that  list.  If  the  RCL  is  empty,  and  the  mode  of  the  incoming  request  is  compatible  with  the 
modes  of  the  ongoing  requests  in  the  OCL,  the  new  request  is  added  to  the  OCL,  but  if  the  mode  is 
incompatible,  then  it  is  added  to  the  RCL.  Once  the  incoming  request  has  been  added  to  the  OCL, 
the  FAOpen  is  complete,  and  the  client  is  signalled.  If  the  incoming  request  is  added  to  the  RCL,  the 
FAOpen  is  suspended,  pending  lock  availability. 

When  the  FAClose  operation  is  requested,  the  OCL  is  first  checked  for  the  requesting  client,  and  the 
corresponding  entry  is  removed.  Next,  the  FDCIoseO  operation  is  invoked  for  the  appropriate 
Directory  Management  (FDM)  arobject.  The  FDM  arobject  maintains  a  count  of  outstanding  open 
requests,  and  informs  the  File  Arobject  whether  it  should  continue,  deactivate,  or  kill  itself.  If  the  FA 
has  to  continue,  it  moves  all  of  the  compatible  entries  from  the  head  of  the  RCL  to  the  OCL,  and 
notifies  the  corresponding  clients.  If  there  are  no  outstanding  open  requests,  the  FA  deactivates 
itself.  If  the  file  has  been  deleted,  and  the  last  FAClose  operation  has  completed,  the  FA  kills  itself. 

The  reading  and  writing  of  normal  and  permanent  files  is  very  straightforward.  The  system  "copy" 
mechanism  carries  out  the  transfer  of  data  between  the  FA  address  space  and  the  client’s  address 
space.  Reading  and  writing  to  disk  is  automatically  handled  by  the  virtual  memory  manager.  Each 
time  the  FAWrite  and  FAZero  operations  are  invoked,  the  file  arobject  updates  its  notion  of  the  current 
file  size. 


32 

The  size  of  the  tile  saved  here  is  updated  whenever  FAWrite  and  FAZero  operations  are  invoked.  This  is  not  the  most 
up-to-date  v.nliiu,  since  the  Client  AroOicct  Interface  buffers  several  read  .nnd  write  operations. 


le  policy  for  openir>g  files  can  possibly  be  set  by  the  user  to  better  suit  the  constraints  of  the  real  time  application. 


198 


3. 6. 2. 3  Atomic  File  Manipulations 

The  operations  on  atomic  files  are  quite  similar  to  their  counterparts  for  normal  and  permanent  files. 
The  main  difference  is  that  atomicity  of  the  operations  has  to  be  guaranteed  whenever  the  file  is 
manipulated.  The  FAOpen  and  FACIose  operations  are  the  same  as  described  in  the  previous 
section,  since  these  operations  only  affect  the  OCL  and  RCL  data  structures,  and  do  not  touch  the 
atomic  file  data. 

The  FARead  and  FAWrite  operations  are  implemented  as  elementary  transactions,  so  that  they  can 
be  arbitrarily  nested  inside  of  other  transactions.  The  FARead  operation  invokes  the  Copy 
mechanism  for  copying  the  data  from  the  file  arobject’s  address  space  to  the  client's  address  space. 
For  the  FAWrite  operation  however,  the  AtomicCopy  operation  is  invoked^  which  updates  the  file  in 
the  main  memory,  and  also  propagates  the  write  operation  to  the  Atomic  Page  Set  Subsystem  (refer 
Section  3.7.2). 

3.6.3  Prefix  Map  Management 

The  directory  of  the  ArchOS  File  Subsystem  is  distributed  across  multiple  nodes  and  multiple  disks 
of  the  system.  Specifically,  all  directory  entries  with  a  particular  prefix  are  on  a  particular  logical  disk. 
Hence,  we  need  to  maintain  a  system  wide  table  which  can  map  the  directory  fragments 
corresponding  to  different  prefixes,  to  the  logical  disks  on  which  these  fragments  reside.  The  main 
function  of  the  File  Prefix  Map  Management  Subsystem  (or  FPMM  subsystem)  is  to  maintain  this 
mapping  in  a  table  known  as  the  Prefix  Map  Table  (or  PMT).  There  is  one  instance  of  the  FPMM 
subsystem  at  the  kernel  level  of  each  node  of  the  distributed  system.  A  copy  of  the  entire  system  wide 
Prefix  Map  Table  is  maintained  by  each  FPMM  instance. 

All  of  the  functionality  provided  by  the  FPMM  subsystem  is  related  to  mapping  file  name  prefixes  to 
IDs  of  aporooriate  Directory  Management  Arobjects,  which  save  the  directory  entries  for  those 
prefixes.  Once  the  appropriate  Directory  Mangement  Arobject  for  a  file  has  been  determined,  all 
future  operations  on  that  file  are  performed  either  by  the  File  Aroject,  or  by  the  Directory  Mangement 
Arobject  (depending  on  the  operation  in  question).  The  FPMM  subsystem,  on  the  other  hand, 
provides  operations  for  accessing  and  maintaining  the  Prefix  Map  Table,  e.g.  at  restart,  and  when  disk 
volumes  are  mounted  and  unmounted.  The  ability  to  create  and  dele*"!  new  directory  fragments  by 
adding  or  removing  file  prefixes  is  also  provided. 


199 


fdm-aid  =  FPMap(filename) 

val  =  FPRestartO 
val  =  FPMount(diskid) 
val  =  FPUnmount(diskid) 
val  =  FPAssign(prefix,  diskid) 
val  =  FPUnassign(prefix) 

val  =  FPInsertTable(prefix,  diskid,  fdm-aid) 

val  =  FPRemoveTable(prefix) 

val  =  FPRequestTable(tablebuffer) 


AID  fdm-aid  The  arobject  ID  of  the  Directory  Management  subsystem. 

BOOLEAN  val  TRUE  if  the  specified  operation  is  done  successfully,  otherwise  FALSE. 
FILENAME  filename 

The  name  of  the  file  for  whic'.  the  specified  operation  is  to  be  performed, 
DISKID  diskid  The  Logical  Disk  ID  of  the  disk  volume  being  operated  on, 

FILENAME  prefix  The  filename  prefix  being  used  in  the  specified  operation. 

TABLEBUFFER  'tablebuffer 

The  buffer  used  for  returning  the  contents  of  the  Prefix  Map  Table. 


The  FPMap  primitive  returns  the  arobject  ID  of  the  Directory  Mangement  Subsystem  instance  which 
manages  the  directory  fragment  for  the  given  file.  This  primitive  will  usually  be  invoked  by  the  Client 
Arobject  Interface,  when  the  first  operation  (e.g.  CreateFile  or  OpenFile)  is  requested  by  a  client. 
Once  the  appropriate  Directory  Management  Arobject  for  a  specific  file  has  been  determined,  any 
further  requests  relatd  to  that  file  will  not  be  addressed  to  the  FPMM  Subsystem. 


The  main  purpose  of  the  FPRestart  primitive  is  to  reconstruct  the  Prefix  Map  Table.  This  is  done  by 
determining  which  disks  are  mounted  on  this  node,  and  by  acquiring  information  from  the  PMTs 
residing  at  other  nodes  in  the  system.  The  FPMount  primitive  is  used  when  a  disk  volume  is  mounted. 
If  the  new  disk  has  a  portion  of  the  directory  on  it,  the  Prefix  Map  Table  is  modified  to  reflect  this,  and 
an  FDM  arobject  is  created  to  manage  that  directory.  The  FPUnmount  primitive  is  used  when  a  disk 
volume  is  removed.  Any  entries  corresponding  to  this  disk  in  the  Prefix  Map  Table  are  removed,  and 
the  FDM  arobject  is  informed. 


200 


The  FPAssign  and  FPUnassign  primitives  are  used  for  adding  and  removing  prefixes  to  and  from  the 
entire  file  system  directory.  A  new  prefix  can  be  added  to  the  specified  disk  irrespective  of  whether 
the  disk  already  has  a  part  of  the  directory  on  it  or  not.  When  a  new  prefix  is  added,  modifications  are 
made  to  the  Prefix  Map  Table  and  to  the  disk  itself.  An  FDM  arobject  for  the  disk  also  has  to  be 
created,  if  it  does  not  already  exist.  The  FPUnassign  primitive  removes  a  prefix  from  a  disk  by  making 
modifications  to  the  PMT  and  to  the  disk.  It  also  informs  the  FDM  arobject  if  necessary.  This  primitive 
will  execute  only  if  the  prefix  being  unassigned  does  not  correspond  to  any  existing  files;  otherwise  an 
error  indication  is  returned. 

Three  primitives  are  provided  for  obtaining  and  modifying  information  from  the  Prefix  Map  Table. 
These  primitives  will  be  invoked  primarily  by  FPMM  arobjects  on  other  nodes  of  the  system.  The 
FPInsertTable  primitive  inserts  a  new  entry  into  the  PMT,  which  consists  of  the  directory  prefix  being 
added,  the  ID  of  the  logical  disk  on  which  the  directory  portion  exists,  and  the  ID  of  the  FDM  arobject 
responsible  for  the  management  of  that  directory  portion.  The  FPRemoveTable  primitive  removes  the 
specified  prefix  entry  from  the  PMT.  The  FPRequestT able  primitive  asks  for  all  the  contents  of  the 
PMT  to  be  returned  in  the  buffer  provided. 

3.6.3. 1  Components  of  the  Prefix  Map  Management  Subsystem 

The  FPMM  subsystem  consists  of  three  components;  (1)  the  Prefix  Map  Table  or  the  PMT,  (2)  the 
Prefix  Map  Manager  process,  and  (3)  the  Prefix  Map  Worker  process  (refer  Figure  3-17).  The  PMT 
keeps  a  copy  of  the  entire  system  wide  mapping  betv.^een  prefixes  and  directory  locations.  Its  entries 
are  a  series  of  triples  consisting  of  the  file  name  prefix,  the  ID  of  the  logical  disk  containing  the 
directory  fragment  corresponding  to  this  prefix,  2md  the  ID  of  the  FDM  arobject  which  manages  this 
directory  fragment.  It  is  important  to  ensure  the  consistency  of  the  multiple  copies  of  the  PMT. 

The  Prefix  Map  Manager  process  is  responsible  for  providing  most  of  the  functionality  of  the  FPMM 
subsystem,  it  accepts  ail  reguests  for  FPMM  operations,  and  returns  the  results.  The  only  purpose  of 
the  Prefix  Map  Worker  process  is  to  wait  on  behalf  of  the  Manager  process  when  peer  level  primitives 
are  invoked  by  the  Manager  process.  These  primitives  (FPInsertTable,  FPRemoveTable, 
FPRequestT able)  are  used  for  maintaining  system  wide  consistency  between  the  PMT  instances.  In 
the  absence  of  the  Worker  process,  deadlocks  can  arise  if  multiple  peer  operations  are  in  progress 
concurrently. 


201 


Prefix  Map  Table 


Fi1«  Naint 

PrafiK 

logical 

Disk  10 

Oiractory 
Managtr  AID 

Figu  re  3- 1 7 :  Components  of  the  Prefix  Map  Management  Subsystem 


3.G.3.2  Directory  Mounting  and  Reassignment 

In  order  to  understand  the  implementation  of  the  primitives  FPMount,  FPUnmount,  FP Assign,  and 
FPUnassign,  it  is  important  to  have  some  information  about  the  layout  of  the  logical  disk  volumes. 
There  is  a  special  page  (Page  0)  on  each  disk  volume  which  contains  information  about  the  contents 
and  location  of  several  important  data  structures  stored  on  the  disk.  Hence,  by  examining  Page  0,  it  is 
possible  to  determine  whether  there  are  any  directory  fragments  on  the  disk,  and  if  so,  what  file 
prefixes  they  correspond  to. 

When  a  logical  disk  is  mounted  on  a  node,  the  FPMount  primitive  is  invoked  on  that  node  to  update 
the  PMT  if  necessary.  First,  Page  0  of  the  disk  is  checked  to  see  whether  the  disk  has  a  directory 
fragment  on  it,  and  if  so,  to  determine  the  prefixes  which  are  mapped  on  it.  All  these  prefixes  are  then 
entered  in  the  local  copy  of  the  PMT,  along  with  the  ID  of  the  logical  disk.  An  FDM  arobject  is  created 
and  initialized  to  manage  the  newly  mounted  disk  directory,  and  its  arobject  ID  is  entered  in  the  PMT. 
This  completes  the  updating  of  the  local  PMT,  All  newly  added  entries  are  now  sent  to  all  other 
instances  of  FPMM  by  invoking  the  FPlnsertTable  primitive  on  peer  FPMM  arobjects. 

When  a  disk  is  utirnounted,  the  FPUnmount  primitive  is  invoked.  If  there  are  any  entries  in  the  PMT 


202 


which  correspond  to  this  logical  disk,  these  entries  are  removed.  The  FDM  arobject  for  that  disk  is 
cleaned  up  and  destroyed  by  invoking  the  FDUnmount  primitive.  Furthermore,  peer  FPMM  arobjects 
are  informed  by  invoking  the  FPRemoveTable  primitive. 

The  FPAscign  primitive  assigns  a  new  prefix  to  a  particular  logical  disk.  First  Page  0  of  the  disk  is 
checked  to  see  whether  it  already  has  a  directory  on  it  or  not.  If  there  is  no  directory,  a  directory  root 
page  set  is  created  on  that  disk,  and  a  pointer  to  its  root  page  is  entered  in  Page  0,  along  with  the  new 
prefix  being  added.  An  FDM  arobject  also  has  to  be  created  to  manage  the  directory  on  this  disk.  An 
entry  is  added  to  the  PMT  corresponding  to  the  new  prefix,  logical  disk,  and  FDM  arobject.  The 
addition  of  a  new  prefix  to  a  disk  which  already  has  a  directory  fragment  on  it  is  much  simpler.  It  only 
requires  a  new  entry  on  Page  0  of  the  disk  and  in  the  PMT,  since  the  directory  root  page  set  and  FDM 
arobject  are  already  present,  in  both  cases,  once  the  PMT  has  been  updated,  the  peer  FPMMs  are 
informed  of  the  new  entry. 

The  FPUnassign  primitive  removes  a  directory  segment  (corresponding  to  a  prefix)  from  a  disk. 
However,  this  operation  is  only  allowed  when  there  are  no  files  in  the  system  which  correspond  to  that 
prefix.  First,  the  prefix  is  removed  from  Page  0  of  the  disk.  If,  as  a  result  of  this  removal,  there  are  no 
more  prefixes  left  on  the  disk,  the  directory  root  page  set  also  has  to  be  freed.  If  the  disk  has  no 
directory  segment  left  on  it,  the  FDM  arobject  for  the  disk  is  destroyed  by  invoking  FDUnmount.  In 
any  event,  the  entry  for  the  prefix  is  removed  from  the  PMT,  and  the  peer  FPMMs  are  informed  of  the 
removal, 

3. 6. 3. 3  Restart 

The  main  purpose  of  the  FPRestart  primitive  is  to  recreate  the  Prefix  Map  Table.  To  do  this,  the 
Manager  process  first  invokes  the  Worker  process  to  obtain  the  PMT  (with  FPRequestTable)  from  one 
of  the  other  nodes  in  the  system.  In  the  meantime,  the  Manager  itself  checks  the  Mount  Table  of  its 
node  (see  Section  3.7.4)  to  delern>ine  all  the  logical  disks  on  the  node.  For  each  mounted  disk,  it 
essentially  executes  the  functionality  of  the  FPMount  primitive.  It  checks  Page  0  for  prefixes,  enters 
these  in  the  PMT,  and  creates  an  FDM  arobject  for  each  disk  with  a  directory  fragment.  Once  all 
entries  are  made  to  the  PMT,  all  peer  FPMMs  are  informed  of  the  new  entries. 

There  was  a  danger  of  deadlocks  in  the  restart  sequence,  especially  if  multiple  nodes  happened  to 
come  up  at  the  same  time.  To  avoid  deadlock,  a  separate  Worker  process  is  provided,  which  waits  on 
the  peers  FPMMs.  In  addition,  a  timeout  is  associated  with  the  FPRequestTable  primitive  to  avoid 
indefinite  blocking.  If  several  nodes  come  up  at  the  same  time,  each  node  can  incorporate  entries  in 
Its  PMT  pertaining  to  its  own  logical  disks,  and  then  send  this  information  to  the  other  nodes. 


203 


3.6.4  Directory  Management 

The  main  purpose  of  the  Directory  Management  Subsystem  is  to  maintain  a  part  of  the  file  directory, 
and  thereby  map  filenames  occurring  in  this  directory  fragment  to  their  corresponding  global  page  set 
identifiers.  This  mapping  determines  the  logical  disk  on  which  the  file  resides,  and  the  ID  of  the  page 
set  arobject  which  manages  the  root  page  set  for  the  file.  In  addition  to  the  mapping  information,  the 
directory  also  maintains  static  file  information,  such  as  creation  date,  modification  date,  and  file 
length.  An  instance  of  the  File  Directory  Management  Subsystem  (FDM  Subsystem)  exists  for  each 
disk  with  a  directory  fragment  on  it. 

Operations  which  require  directory  mapping  information,  such  as  FDOpen,  FDCIose,  FDCreate,  and 
FDDelete,  are  all  supported  by  the  FDM  subsystem.  Filename  matching  operations,  such  as  finding  all 
files  with  a  matching  prefix,  are  also  directory  oriented  operations,  and  are  performed  by  this 
subsystem.  In  addition  to  these  operations,  primitives  are  provided  for  accessing  and  modifying  the 
directory  entries. 


fa-aid  *  FDOpen(filename) 

fa-aid  *  FDCreate(filename,  type  [,  node]) 

action  *  FDCIose(modified,  filesize) 

val  =  FDDelete(filename) 

val  s  FOUndeleteAtomic(fiiename) 

val  s  FDExpungeAtomic(filename) 

val  s  FDSync([filename]) 

last  =  FDFindfprefix,  after,  comp,  buffer,  bufsize) 

val  =  FDInsertffilename,  file-info) 
val  =  FDGetffilename,  file-info) 
val  =  FDRemove(filename) 

val  =  FDRestart(diskid) 
val  =  FDUnmountO 


AID  fa-aid  The  ID  of  the  File  Arobject,  corresponding  to  the  file  being  opened  or  created.  ■ 

FD ACTION  action  Specifies  one  of  three  possible  actions  to  be  taken;  CONTINUE,  DEACTIVATE,  or 
KILL. 


BOOLEAN  val  TRUE  if  the  specified  operation  is  done  successfully,  otherwise  FALSE. 

FILENAME  last  A  number  of  compc  lents  or  tails  (depending  on  the  value  of  comp)  of  filenames 

are  found,  matching  the  given  prefix.  The  last  component  or  tail  i.s  returned  in  the 
Zest  variable.  If  the  end  of  the  list  of  components  or  tails  has  been  reached,  a 
special  value  (NULL)  is  returned. 


204 


FILENAME  filename 

The  name  of  the  file  for  which  the  specified  operation  is  to  be  performed. 
FILETYPE  type  The  type  of  file  to  be  created^  NORMAL,  PERMANENT,  o'"  ATOMIC. 

NODENAME  node  The  ID  of  the  preferred  node  on  which  the  file  should  be  created. 

BOOLEAN  modified 

Flag  which  specifies  whether  the  file  has  been  modified  or  not. 

INT  filesize  The  size  of  the  file  in  bytes. 

FILENAME  prefix  The  filename  prefix  which  has  to  be  matched,  and  corresponding  to  which  all 
filename  components  or  tails  of  filenames  have  to  be  returned. 

FILENAME  after  The  value  returned  by  the  variable  last  is  used  here.  Hence  it  is  either  a 
component  or  tail  of  a  filename.  Matching  of  components  or  tails  has  to  start  after 
this  element.  If  the  value  of  after  is  NULL,  matching  starts  from  the  first  element. 

BOOLEAN  comp  If  the  value  is  TRUE,  only  the  next  components  of  the  filenames  following  the 
matching  prefix  are  returned.  If  the  value  is  FALSE,  the  entire  tails  of  filenames 
are  returned. 

BUFFER  ’buffer  The  address  for  the  data  buffer. 

INT  bufsize  The  size  of  the  buffer  being  provided. 

FILEINFO  file-info  Returns  all  the  information  about  a  file  saved  in  the  directory  entry. 

DISKID  diskid  The  Logical  Disk  ID  of  the  di^  volume  being  operated  on. 

The  FDOpen  primitive  opens  an  existing  file.  If  this  is  the  first  FDOpen  operation  on  the  file,  the 
Arobject/Process  Management  Subsystem  is  called  upon  to  activate  the  inactive  file  arobject,  and 
return  the  ID  of  the  active  file  arobject.  If  the  file  has  already  been  opened,  the  count  of  opens  is 
incremented,  and  the  ID  of  the  active  file  arobject  is  returned.  If  the  file  does  not  exist,  an  error  is 
returned.  The  FDCreate  primitive  creates  a  file  of  the  specified  type  (NORMAL,  PERMANENT,  or 
ATOMIC)  on  the  preferred  computing  node,  if  possible.  The  Arobject/Process  Management 
Subsystem  is  requested  to  create  a  file  arobject  of  the  appropriate  type,  and  the  ID  of  this  arobject  is 
returned.  A  directory  entry  for  the  newly  created  file  is  added.  Following  creation,  the  file  is  opened. 
If  the  file  already  exists,  an  error  is  returned. 

The  FDCIose  primitive  is  always  invoked  by  the  file  arobject,  when  some  client  has  requested  the 


205 


FAClose  operation.  It  responds  by  specifying  one  of  three  operations  (CONTINUE,  DEACTIVATE,  or 
KILL),  to  be  carried  out  by  the  file  arobject.  It  also  modifies  the  static  file  information  saved  in  the 
directory.  If  the  file  has  been  modified,  the  modification  date  and  time  is  changed  to  the  present  time, 
and  the  file  length  is  updated. 

The  actions  taken  by  the  FDDelete  primitive  depend  on  whether  the  file  is  open  or  not  at  the  time  of 
the  request.  If  the  file  arobject  is  inactive,  FDDelete  deletes  the  file  amd  removes  its  directory  entry.  If 
the  file  is  open  at  the  time  the  FDDelete  request  is  received,  a  clean  file  deletion  is  provided,  further 
FDOpen  operations  are  not  allowed,  and  once  the  file  has  been  closed  by  all  current  users,  it  is 
deleted.  In  the  case  of  atomic  files,  the  FDQeiete  primitive  merely  flags  the  file  (in  the  directory)  as 
deleted,  so  that  it  can  be  recovered  in  case  the  transaction  is  aborted. 

The  FDUndeleteAtomic  and  FDExpungeAtomic  primitives  are  provided  to  allow  atomic  deletion  of 
files.  The  FDDelete  primitive  for  an  atomic  file  sets  a  flag  in  the  directory  entry.  After  the  transaction 
commits,  the  garbage  collector  can  invoke  the  FDExpungeAtomic  primitive,^  which  removes  the 
directory  entry,  and  expunges  the  file.  If,  on  the  other  hand,  the  transaction  has  to  be  aborted,  the 
FDDelete  can  be  undone  by  the  FDUndeleteAtomic  primitive,  since  the  file  has  not  been  expunged. 

The  FDSync  operation  ensures  that  any  buffered  directory  information  pertaining  to  the  specified 
file  is  synchronized  with  the  version  saved  on  disk.  If  a  file  name  is  not  specified,  the  contents  of  the 
entire  directory  buffer  are  synchronized. 

The  purpose  of  the  FDFind  primitive  is  to  allow  some  of  the  convenience  of  hierarchical  directories. 
In  a  file  system  with  a  hierarchical  directory,  it  is  usually  possible  to  search  for  all  files  and 
subdirectories  which  exist  in  a  particular  directory.  The  FDFind  primitive  implements  the  same  notion 
in  our  "flat"  file  name  space.  Given  a  file  prefix  (parallel  to  the  full  pathname  of  a  directory)  it  can  find 
all  the  components  (filenames  or  subdirectory  names  contained  in  the  director/).  It  is  also  possible  to 
find  the  full  tails  of  all  matching  file  names  (parallel  to  doing  a  recursive  directory  search).  The 
boolean  variable  comp  determines  v/hich  type  of  search  is  undertaken;  for  matching  components  or 
tails.  Since  the  buffer  for  returning  the  matching  elements  may  not  be  able  to  hold  the  entire  list  of 
matches  found,  the  last  matching  element  is  returned  as  the  result  of  the  FDFind  primitive.  The 
search  can  begin  after  this  element  when  the  next  FDFind  operation  is  invoked. 

Three  primitives  are  provided  for  manipulating  the  directory  entries.  The  FDInsert  primitive  allows  a 


34 


The  FOexpungtAtomlc  operation  can  alternatively  tje  invoked  by  the  transaction  manaocr,  or  even  the  client. 


206 


new  entry  to  be  inserted  in  the  directory.  The  FDGet  primitive  allows  the  directory  entry  for  a 
particular  file  to  be  read,  and  the  FDRemove  primitive  removes  the  entry  for  a  particular  file.  These 
primitives  can  be  used  by  the  Client  Arobject  Interface  to  provide  several  useful  functions  to  the 
client.  For  example,  the  RenameFile  primitive  is  implemented  by  first  obtaining  file  information  using 
FDGet,  then  removing  the  old  directory  entry  with  FDRemove,  and  finally  adding  a  new  directory  entry 
with  FDinsert  to  correspond  to  the  new  file  name. 

The  FDRestart  primitive  is  primarily  an  initialization  operation,  invoked  when  the  FD-  arobject  is 
created.  A  cleanup  operation  on  the  logical  disk  associated  with  this  FD  arobject  instance  is  initiated. 
The  directory  is  checked  for  all  normal  files,  and  the  Page  Set  Subsyster^.  is  asked  to  free  all  page  sets 
corresponding  to  these  files.  Their  directory  entries  are  also  removed.  The  FDUnmount  operation  is 
invoked  when  the  logical  disk  associated  with  this  FDM  instance  has  to  be  unmounted  cleanly.  Once 
an  FDUnmount  hats  been  received,  FDM  does  not  accept  any  new  operations  except  FDClose.  When 
all  currently  open  files  have  been  closed,  it  sends  a  reply  to  the  requestor  of  FDUnmount,  and 
destroys  itself. 

3.6.4. 1  Components  of  the  Directory  Management  Subsystem 

The  FDM  Subsystem  consists  of  three  main  components;  (1)  the  Directory  Manager  process,  (2)  the 
buffer  for  the  directory  B-tree.  and  (3)  the  Open  Files  Table.  In  addition  to  these  components,  a  few 
other  items  of  information  are  also  maintained,  such  as  the  logical  disk  ID,  and  some  flags.  The  three 
components  are  shown  in  Figure  3-18,  and  described  briefly  in  this  section. 

The  Directory  Manager  process  is  responsible  for  providing  all  of  the  functionality  of  the  FDM 
subsystem.  It  accepts  all  requests  for  FDM  operations,  and  returns  the  results.  It  manages  all  the 
FDM  data  structures  as  well.  In  order  to  maintain  the  directory  buffer,  it  invokes  operations  on  the 
Page  Set  Subsystem. 

The  file  system  directory  uses  a  B-tree  structure,  and  is  saved  on  disk  as  a  number  of  different  page 
sets  (refer  Section  3.6.4. 2  for  details).  Some  of  the  pages  of  the  directory  are  buffered  in  main 
memory  by  the  FDM  subsystem,  in  a  data  structure  known  as  the  Directory  Buffer.  The  FDM  Manager 
Process  interacts  with  the  Page  Set  Subsystem  in  obtaining  directory  pages  from  disk,  eurid  in  writing 
back  any  dirty  buffered  pages. 

The  Open  Files  Table  keeps  track  of  the  files  v/hich  are  open.  It  maintains  the  names  of  these  files, 
their  arobject  IDs,  and  the  number  of  outstanding  opens  on  each  file.  In  addition,  it  maintains  some 
flags,  such  as  for  open  files  which  are  deleted  (or  expunged).  Whenever  an  FDOpen  request  is 


Accipt  I 
FO 

Optrationsi 


Rpply 
Rtsul tt 


Di rectory 
B-Tree  Buffer 


Directory 

Manager 


Request 
Page  Set 
Operations 


Reply 

Results 


FA 

AID 

Count 

of 

oppnt 

Flags 

File  Naaie  Strings 


Open  Files  Table 


Figu  re  3- 1 8:  Components  of  the  Directory  Management  Subsystem 


received,  the  OFT  is  first  checked  to  see  whether  the  file  has  already  been  opened  or  not.  When  a  file 
is  opened  for  the  first  time,  an  entry  is  made  in  the  OFT,  but  for  subsequent  opens,  the  count  of 
outstanding  opens  is  incremented.  For  each  FDClose  operation,  the  count  of  outstanding  opens  for 
that  file  is  decremented.  If  the  count  becomes  zero,  the  entry  Is  removed  from  the  OFT,  and  the  file 
arobject  is  told  to  deactivate  itself.  If  an  FDDelote  operation  is  received  on  an  open  file,  a  DELETE 
flag  is  set  in  the  OFT.  Further  FDOpen  operations  are  not  allowed,  but  FDClose  operations  are 
accepted,  to  achieve  a  clean  deletion.  Once  a  DELETE  flag  has  been  set,  the  file  arobject  is  told  to 
kill  itself  at  the  time  of  the  last  close  operation.^ 


If  an  atomic  file  is  deleted  in  this  way,  Its  file  arobject  is  du.'ictivatcd,  and  not  killed.  A  flag  is  set  m  the  directory  entry  tor 
ttie  file,  indicating  that  it  has  been  deleted,  and  furthei  operations  on  that  file,  other  tfi.'in  FOUndelcteAtomic  or 
FDExpungcAtomic.  are  not  aflowed. 


208 


3. 6. 4. 2  The  Directory  B-Tree 

The  file  system  directory  maps  file  names  to  global  page  set  IDs.  It  also  saves  some  static 
information  for  each  file:  the  date  and  time  of  creation  and  last  modification,  the  file  type,  the  file 
length,  and  some  flags  (e.g.  the  DELETED  flag,  to  mark  files  which  have  been  deleted  but  not 
expunged).  Logically,  the  file  system  directory  fragment  on  a  particular  disk  is  an  independent  B-tree. 
The  entries  in  this  B-tree  are  ordered  by  the  filename  (in  alphabetical  order).  Each  node  of  the 
directory  B-tree  is  implemented  as  a  page  set.  Thus,  the  directory  is  implemented  a  set  of  page 
sets,  which  are  maintained  on  disk  by  the  Page  Set  Subsystem. 


\ 


\  Pas* 
/  0 


/ 

\  P»9B* 

/  I  and  Z 

}  Pages 

3  and  up 


Figu  re  3- 1 9:  Structure  of  a  Directory  B-tree  Node 


The  structure  of  a  node  in  the  directory  B-tree  is  somewhat  complex  (refer  Figure  3-i9).  Each  node 
of  the  directory  B-tree  contains  a  large  number  of  entries  (200  to  250).  Logically,  each  directory  entry 
!S  a  triple  consisting  of  the  filename,  the  global  page  set  ID,  and  a  pointer  to  the  file  information  block. 
The  actual  implementation  is  complicated  by  the  fact  that  file  names  have  variable  sizes.  Hence, 
instead  of  keeping  the  file  name  as  the  first  element  of  the  triple,  a  pointer  to  the  file  name  is  kept. 
This  file  name  pointer  is  a  byte  offset  from  the  start  of  the  area  for  saving  the  full  file  names. 


209 


Each  node  of  the  directory  B-tree  is  a  page  set.  The  first  page  of  this  page  set  is  the  header  page, 
which  contains  a  series  of  triples.  The  first  element  of  the  triple  is  a  pointer  to  the  full  file  name,  the 
second  is  the  global  page  set  identifier,  and  the  third  is  a  pointer  to  the  information  block  (fixed  size) 
for  this  file.  If  the  node  is  internal  (not  a  leaf  node  in  the  tree),  the  page  set  ID  points  to  the  node  of  the 
B-tree  to  be  searched  next.  However,  if  the  node  is  external  (a  leaf  node),  the  page  set  ID  refers  to  the 
location  of  the  file  (ID  of  the  root  page  set  for  the  file).  The  second  and  third  pages  of  the  node  page 
set  are  reserved  for  the  file  information  blocks.  These  pages  will  be  unused  in  the  case  of  internal 
nodes.  The  full  file  names  are  stored  starting  at  the  fourth  page.  Hence,  the  pointer  to  the  file  name 
holds  a  byte  offset  from  the  start  of  the  fourth  page.^ 

3.6. 4. 3  FDM  Fite  Manipulation  Operations 

in  this  section,  we  will  briefly  discuss  the  key  interactions  of  the  of  FDM  file  manipulation  operations: 
FDOpen,  FDClose,  FDCreate,  FDDelete,  FDUndeleteAtomic,  and  FDExpungeAtomic.  Each  of  these 
operations  (except  FDClose)  is  invoked  by  some  Client  Arobject  Interface,  and  the  result  is  returned 
there. 

When  the  operation  FDOpen  is  requested,  the  Open  Files  Table  is  first  checked  to  see  whether  the 
file  has  already  been  opened.  If  this  is  indeed  the  case,  the  arobject  ID  for  the  file  can  be  determined 
from  the  OFT.  No  further  action  is  required,  except  the  updating  of  the  count  of  opens  for  the  file. 
However,  if  this  is  the  first  open  request,  the  inactive  file  arobject  heis  to  be  activated.  First  the 
directory  B-tree  is  searched  (by  bringing  appropriate  directory  pages  into  the  Directory  Buffer)  to  find 
the  global  page  set  ID  (GPSID)  for  the  (root  page  set  of)  file  in  question.  Then  the  Arobject/Process 
Management  Subsystem  is  called  to  activate  the  arobject  for  the  given  page  set,  and  return  the  file 
arobject  ID.  An  entry  for  this  file  arobject  is  now  made  in  the  OFT.  Also,  the  newly  created  file 
arobject  is  initialized  by  invoking  the  FARestart  primitive.  This  completes  the  FDOpen  sequence,  and 
the  new  file  arobject  ID  is  returned  to  the  FDM  subsystem. 

The  FDCreate  sequence  is  similar  in  flavor  to  the  FDOpen  sequence.  First  the  directory  S-tree  is 
checked  to  be  sure  that  the  file  does  not  already  exist.  Next,  the  Arobject/Process  Management 
Subsystem  is  invoked  to  create  a  new  file  arobject  of  the  specified  type.  The  A/PM  also  decides 
where  the  new  file  is  to  be  placed^.  The  GPSID  and  the  arobject  ID  for  the  newly  created  file  arobject 

^For  efficiency  reasons,  the  tile  name  pointer  may  actually  consist  of  two  parts:  a  prefix  pointer,  and  a  last  component 
pointer  Since  file  names  are  typically  long,  .and  many  of  the  file  names  for  a  given  directory  node  Oilier  only  in  the  values  of 
their  last  components,  this  scheme  of  storing  names  should  significantly  reduce  directory  storage  requirements. 

37 

The  file  filacement  algorithm  is  subsumed  by  the  arobject  and  process  placement  algorithms,  and  is  implemented  by  the 
Arobiect/r-'iocess  Management  Subsystem.  If  a  preference  for  pl.acement  is  specified,  this  is  taken  into  account  in  making  the 
final  decision 


210 


are  returned  by  A/PM.  An  entry  for  the  file  is  created  in  the  directory,  and  in  the  OFT.  Finally,  the 
new  file  arobject  is  "restarted"  (with  FARestart),  and  its  ID  is  returned  to  the  FDM  subsystem. 

The  FDClose  primitive  is  always  invoked  by  the  file  arobject,  when  a  client  has  requested  an 
FAClose  operation.  The  FDM  subsystem  responds  to  this  operation  by  specifying  one  of  three  actions 
for  the  file  cirobject:  CONTINUE,  DEACTIVATE,  or  KILL.  When  FDClose  is  requested,  first  the  count 
of  outstanding  open  requests  in  the  OFT  is  decremented.  If  some  outstanding  opens  still  remain,  the 
CONTINUE  command  is  given  to  the  file  arobject.  If  however,  there  are  no  outstanding  opens,  the 
entry  for  this  file  in  the  OFT  is  deleted.  If  the  file  has  been  modified  [modified  is  TRUE),  the 
information  blocks  for  the  file  are  updated.  Finally,  the  file  arobject  is  asked  to  deactivate  itself.  The 
last  FDClose  operation  on  a  file  is  somewhat  different  if  the  file  has  already  been  flagged  for  deletion 
by  some  other  client.  The  directory  entry  for  the  file  is  removed,  and  the  file  arobject  is  asked  to  kill 
itself  (which  automatically  removes  the  associated  page  sets). 

The  implementation  of  the  FDDelete  primitive  is  quite  straightforward.  If  the  file  arobject  is  inactive 
when  the  primitive  is  invoked,  the  Arobject/Process  Management  Subsystem  is  called  to  destroy  the 
inactive  file  arobject  (which  removes  the  relevant  page  sets),  and  the  directory  entry  for  the  file  is 
removed.  If  the  file  is  active  at  the  time  FDDelete  is  called,  the  delete  flag  is  set  for  the  file  in  the  OFT. 
At  the  time  of  the  last  FDClose  operation,  the  file  is  deleted  by  destroying  the  active  file  arobject. 

The  handling  of  the  primitives  (described  above)  for  atomic  files  may  be  somewhat  different. 
Creating,  deleting,  opening,  and  closing  files  atomically  is  a  part  of  the  bigger  and  more  general 
problem  of  creating,  deleting,  activating,  and  deactivating  arobjects  atomically.  Hence,  these  issues 
have  to  be  addressed  in  a  unified  manner.  At  present,  some  hooks  have  been  provided,  which  help  in 
treating  these  primitives  (especially  FDDelete)  as  compound  transactions  with  compensation  actions. 
Thus,  when  an  atomic  file  is  to  be  deleted,  the  directory  is  merely  flagged  as  deleted,  but  the  arobject 
is  not  destroyed.  Once  the  highest  level  transaction  has  committed,  the  garbage  collector  can 
expunge  all  "deleted"  files,  by  calling  FDExpungo Atomic.  If  the  transaction  has  to  «ibort,  the  deleted 
file  can  be  reclaimed  by  calling  FDUndeleteAtomic. 

3.6.5  File  Management  Scenarios 

In  this  section,  we  will  show  how  the  four  subsystems  within  the  File  Subsystem  interact  with  one 
another,  and  with  other  subsystems  in  the  kernel,  to  provide  the  specified  functionality.  We  will 
examine  the  operations  most  frequently  encountered  in  the  lifetime  of  a  file,  and  thereby  explain  the 
interactions  within  the  File  Subsystem. 


(6)  Cr*«t# 


Figure  3*20:  Interactions  in  the  Implementation  of  the  CreateFile  Primitive 


The  CreateFile  primitive  is  invoKed  by  the  client  when  a  new  file  has  to  be  created.  In  response  to 
this  request,  the  Client  Arobject  Interface  has  to  first  determine  the  ID  of  the  Directory  Management 
arobject  which  will  manage  the  directory  fragment  corresponding  to  the  new  file.  The  Client  Arobject 
Interface  first  invokes  the  FPMap  primitive  (refer  Rgure  3-20).  The  FPMap  operation  is  accepted  by 
the  FPMM  Manager  process,  which  checks  the  prefix  of  the  filename  in  the  Prefix  Map  Table,  to 
determine  the  ID  of  the  corresponding  FOM  arobject  (let  this  value  be  fdm-aid).  The  result  of  the 
FPMap  operation  {fdm-aid)  is  returned  to  the  Client  Arobject  Interface.  The  CA  Interface  now  invokes 
the  FDCreate  operation  on  the  correct  instance  of  the  Directory  Management  Arobject  {fdm-aid).  The 
FDM  Manager  reads  relevant  pages  of  the  directory  into  the  buffer,  and  ensures  that  the  file  does  nut 
already  exist.  It  then  calls  the  Arobject/Process  Management  Subsystem  to  create  a  file  arobject 
(using  the  CreateArohject  primitive).  The  Arobject/Process  Management  Subsystem  returns  the  ID  of 
the  newly  created  file  arobject.  Next,  the  FetchAroPjectStatus  primitive  is  used  to  determine  the 
global  page  set  ID  for  the  root  page  set  corresponding  to  this  arobject.  The  Directory  Manager  then 
makes  an  entry  in  the  directory  for  the  new  file,  showing  the  mapping  of  its  filename  and  its  GPSID.  It 


212 


writes  the  creation  date  in  the  file  information  block.  It  also  invokes  the  FAResiart  operation  on  the 
newly  created  file  arobject  to  initialize  it,  and  provide  some  information  such  as  the  file  size.  The  FDM 
Manager  makes  an  entry  for  the  file  in  the  Open  Files  Table,  and  returns  the  ID  of  the  file  arobject  to 
the  Client  Arobject  Interface.  The  CA  Interface  calls  the  File  Arobject  with  an  FAOpen  request.  The 
open  operation  is  carried  out,  and  the  current  filesize  is  returned.  The  CA  Interface  then  returns  to 
the  client  with  the  ID  of  the  file.^ 

The  interactions  of  the  OpenFile  primitive  are  quite  similar  to  those  of  the  CreateFile  primitive.  First, 
the  FPMap  primitive  is  invoked  by  the  Client  Arobject  Interface.  Then,  the  FDOpen  operation  is 
invoked.  If  the  file  has  not  already  been  opened  by  some  other  client,  the  FDM  subsystem  reads  the 
directory  into  the  buffer  to  determine  the  global  page  set. ID  for  the  file.  It  then  invokes  the 
Arobject/Process  Management  Subsystem  to  activate  the  inactive  file  arobject  corresponding  to  the 
GPSID.  The  active  file  arobject  is  initialized  by  FARestart,  and  an  entry  is  made  in  the  Open  Files 
Table.  The  file  arobject  ID  is  returned  to  the  CA  Interface  subsystem.  Finally,  FAOpen  is  called,  and 
the  interactions  of  OpenFile  are  complete. 

The  interactions  in  ReadFile  and  WriteFile  are  quite  straightforward.  The  ReadFile  and  WriteFile 
operations  given  by  the  client  are  buffered  by  the  CA  Interface.  The  FARead  and  FAWrite  operations 
are  invoked  by  the  CA  Interface  as  necessary.  This  reduces  the  number  of  interactions  with  the  file 
arobject. 

The  CioseFHe  operation  given  by  the  client  is  translated  to  FAClose  by  the  CA  Interface  (refer  Figure 
3-21),  after  first  flushing  the  CA  Interface  buffers.  The  file  arobject  removes  the  client  from  the  Open 
Client  List,  and  invokes  the  FDCIose  operation.  The  information  block  for  the  file  is  updated,  and  the 
file  is  closed  as  explained  in  Section  3.6.4.  The  file  arobject  is  told  to  CONTINUE,  DEACTIVATE,  or 
KILL  itself,  depending  cn  other  outstanding  open  requests,  and  whether  the  file  has  been  deleted  by 
some  other  client.  The  file  arobject  replies  to  the  FAClose  operation,  and  then  executes  the  specified 
action.  If  it  has  to  deactivate  or  kill  itself,  it  calls  on  the  Arobject/Process  Management  Subsystem  to 
perform  the  necessary  operation. 

The  interactions  for  DeieteFile  are  quite  simple.  The  DeleteFiie  operation  is  translated  to  FDDelete 
by  the  CA  Interface,  after  first  using  the  FPMap  operation  if  necessary.  The  FDM  subsystem  calls  the 
Arobject/Process  Management  Subsystem  to  destroy  the  inactive  file  arobject,  and  then  removes  its 
directory  entry.  If  the  file  is  open  at  the  time  the  operation  is  invoked,  a  DELETE  flay  is  set  in  the  Open 
Files  Table,  and  the  deletion  is  completed  at  the  time  of  the  last  FDCIose  operation. 

38 

The  li.'f  ID  returned  to  il»e  client  arobject  «  different  from  llic  ID  of  the  cotrcr.porKfing  file  arobject.  The  CA  Interface 
maintains  a  mapping  between  the  forger  file  arobject  ID,  and  the  sliorler  file  10  winch  it  generates. 


3.7  Page  Set  Subsystem 

The  main  purpose  of  the  Page  Set  Subsystem  is  to  provide  an  abstract  interface  to  the  physical 
secondary  memory  storage  devices  (disks).  The  abstraction  provided  is  thai  of  a  set  of  logical  disks, 
each  of  which  holds  some  number  of  page  sets.  A  logical  disk  is  a  contiguous  region  (partition)  of  a 
physical  disk,  consisting  of  N  logical  pages,  numbered  from  0  to  /V-1.^Each  physical  disk  can  contain 
at  most  one  ArchOS  logical  disk.  However,  the  logical  disk  can  be  located  anywhere  on  the  physical 
disk,  and  can  possibly  cover  the  entire  disk.  Each  logical  disk  has  a  unique  identifier  (logical  disk  ID) 
permanently  associated  with  it,  i.e.  it  will  always  have  the  same  logical  disk  ID,  regardless  of  which 
node  or  disk  drive  the  disk  is  mounted  on. 

A  page  set  is  (conceptually)  an  infinite  set  of  pages,  numbered  0,  1,2 .  Any  pages  of  a  page  set 


TO 

CiirrcnHy  a  page  is  defined  to  be  2K  bytes  in  aiae,  but  that  ^  a  parameter  which  can  be  char>ged  and  tuned  to  obtain 
"optimum"  performance. 


214 


which  have  not  been  explicitly  written,  are  defined  as  containing  all  zeros.^Pages  can  be  read  or 
written  in  any  order  within  a  page  set,  i.e.  the  pages  of  a  page  set  are  randomly  accessible,  and 
sparse  page  sets  are  permitted.  All  of  the  pages  comprising  a  single  page  set  will  be  located  on  the 
same  logical  disk.  This  helps  reduce  the  number  of  page  sets  which  would  be  affected  by  any  single 
disk  O'  node  failure.  Every  page  set  is  given  a  page  set  10,  which  is  unique  within  the  logical  disk  on 
which  it  resides.  A  globally  unique  page  set  identifier  (GPSID)  can  then  be  constructed  by  combining 
the  logical  disk  ID  with  the  page  set  ID. 

ArchOS  supports  two  different  classes  of  page  sets:  standard  page  sets,  and  atomic  page  sets. 
Standard  page  sets  themselves  come  in  three  types:  temporary,  permanent,  and  dual.  Temporary 
page  sets  are  intended  to  provide  short  term  storage  for  data  on  secondary  memory.  They  are 
destroyed,  and  their  pages  automatically  reclaimed  by  the  system,  following  a  crash.  Temporary  page 
sets  are  primarily  used  as  the  paging  areas  for  the  “volatile”  segments  of.  process  address  spaces, 
such  as  the  User  Stack,  User  Heap,  and  Shared  Normal  Segments  (see  Section  3.2. 2. 5).  In  particular, 
normal  file  arobjects  (see  Section  3.6)  store  all  of  their  data  in  volatile  segments,  and  hence  in 
temporary  page  sets. 

Permanent  page  sets  are  primarily  used  to  store  permanent  abstract  data  type  instances,  such  as 
permanent  files.  Permanent  page  sets  survive  system  crashes,  but  their  consistency  following  a  crash 
is  not  guaranteed.  If  a  permanent  page  set  was  being  actively  modified  at  the  time  of  a  crash,  some  of 
the  modifications  may  be  permanently  recorded  while  others  are  lost.  Furthermore,  if  a  page  was 
being  written  precisely  at  the  time  of  the  crash,  that  page  could  be  written  incorrectly,  resulting  in  the 
loss  of  some  data.  Note  that  permanent  page  sets  closely  resemble  the  notion  of  “files",  as 
commonly  found  in  more  conventional  operating  systems. 

Dual  page  sets  are  like  permanent  page  sets,  except  that  each  page  is  written  atomically.  If  a  crash 
occurs  while  writing  a  page  in  a  dual  page  set,  either  the  new  contents  of  the  page  will  be 
permanently  and  correctly  recorded,  or  the  original  contents  will  be  recovered  (as  if  the  write 
operation  had  never  occurred).  The  atomic  writing  of  an  individual  page  is  accomplished  by  carefully 
writing  two  copies  of  the  page  (hence  the  name  “dual"  page  set).  For  more  details,  see  the 
explanation  of  “stable  storage"  in  [Sturgis  80].  Dual  page  sets  are  used  for  storing  key  data 
structures,  such  as  file  system  directories,  for  which  it  is  important  to  minimize  the  amount  of  damage 
in  the  event  of  a  crash. 


40 


Of 


course,  such  unwritten  porjea  do  not  occupy  any  space  on  the  logical  disk. 


215 


The  final  type  of  page  set  supported  by  ArchOS  is  the  atomic  page  set.  Atomic  page  sets  actually 
represent  a  second  major  class  of  page  sets,  differing  (somewhat)  from  standard  page  sets,  both  in 
terms  of  their  implementation  and  their  user  interface.  Atomic  page  sets  permit  the  grouping  of 
operations  into  atomic  transactions,  (either  all  of  the  operations  will  be  completed,  or  none  of  them 
will  be).  The  operations  comprising  a  single  transaction  can  involve  multiple  atomic  page  sets,  stored 
on  any  number  of  different  logical  disks,  and  spanning  any  number  of  system  nodes.  Furthermore, 
operations  are  not  restricted  to  being  page-oriented.  Instead,  arbitrary  sequences  of  bytes  can  be 
read  or  written.  Thus,  two  sep2u-ate  transactions  can  modify  different  parts  of  a  single  page,  without 
interference  or  unexpected  inconsistencies  arising. 

Atomic  page  sets  also  support  the  notion  of  nested  transactions.  Both  elementary  and  compound 
transactions  can  be  arbitrarily  nested,  as  explained  in  Section  3.5.  The  commit  or  abort  of  a 
(sub)transaction  will  be  properly  reflected  in  the  contents  of  the  affected  atomic  page  sets.  To 
support  this,  the  atomic  page  sets  contain  the  intermediate  states  of  all  of  the  incomplete,  nested 
transactions,  in  addition  to  the  “current"  page  set  contents.  Atomic  page  sets  provide  complete 
failure  atomicity  for  “soft  and  clean"  failures,  as  defined  in  [Bernstein  83).  However,  the  consistency 
of  atomic  page  sets  in  the  face  of  “concurrent"  updates  from  separate  transactions  is  not 
automatically  guaranteed.  The  lock  management  primitives  of  the  Transaction  Subsystem  must  be 
used  to  ensure  such  consistency  (see  Section  3.5).  Atomic  page  sets  are  used  to  store  atomic 
abstract  data  type  instances,  such  as  atomic  files. 

The  Page  Set  Subsystem  is  implemented  using  four  different  types  of  kernel  arobjects.  Each  node 
of  the  distributed  computer  system  will  contain  one  or  more  instances  of  these  four  arobjects,  as 
illustrated  in  Figure  3-22.  For  each  physical  disk  that  is  mounted  on  a  node,  a  corresponding  instance 
of  the  Logical  Disk  Subsystem,  the  Standard  Page  Set  Subsystem,  and  the  Atomic  Page  Set 
Subsystem  will  be  created.  In  addition,  each  node  will  contain  a  single  instance  of  the  Page  Set 
Restart/Reconfiguration  Subsystem.  Figure  3-22  shows  the  “uses"  relationships  among  these 
component  subsystems. 

The  Logical  Disk  Subsystem  builds  the  logical  disk  abstraction  (discussed  above)  on  top  of  the 
available  physical  disk  hardware.  It  manages  the  allocation  of  pages  on  the  logical  disk,  and  supports 
multi-page  read/write  operations  to/from  either  local  or  remote  primary  memory  (buffer)  areas.  The 
Stanaard  Page  Set  Subsystem  provides  support  for  the  three  types  of  standard  page  sets:  temporary, 
permanent,  and  dual.  It  constructs  these  page  sets  using  the  facilities  of  the  corresponding  Logical 
Disk  Subsystem,  Similarly,  the  Atomic  Page  Set  Subsystem  manages  all  of  the  atomic  page  sets  that 
are  stored  on  the  corresponding  logical  disk.  Finally,  the  Page  Set  Restart/Reconfiguration 


Physical 
Disk  [I] 


Pitysleal 
01*k  [J] 


Figu  re  3* 22 ;  Major  Components  of  the  Page  Set  Subsystem 


Subsystem  (one  per  node)  is  responsible  for  constructing  the  three  "per-disk"  subsystems,  for  each 
disk  that  is  mounted  on  its  node.  It  allows  disks  to  be  "dynamically”  added  and  removed  from  the 
system,  without  requiring  the  associated  node  to  be  completely  restarted.  In  conjunction  with  this, 
the  Page  Set  Restart/Reconfiguration  Subsystem  cooperates  with  its  peers  on  other  nodes,  to 
maintain  a  global  list  (replicated  at  each  node)  of  all  the  disks  mounted  anywhere  v/ithin  the 
distributed  computer  system. 

3.7.1  Standard  Page  Set  Subsystem 

An  instance  of  the  Standard  Page  Set  Subsystem  kernel  arobject  is  associated  with  each  logical 
disk  in  the  system.  The  Standard  Page  Set  Subsystem  supports  three  types  of  page  sets:  temporary, 
permanent,  and  dual.  The  same  set  of  operations  are  provided  for  each  of  these  types  of  page  sets, 
but  the  semantics  of  some  of  the  operations  vary  slightly,  depending  on  the  page  set  type.  For 
example,  the  writing  of  pages  in  a  dual  page  set  is  handled  differently  than  in  the  case  of  a  temporan/ 
or  permanent  page  set.  The  Standard  Page  Set  Subsystem  supports  multi-page  read/write  access  to 


217 


the  contents  of  page  sets,  allowing  greater  efficiency  than  would  be  possible  with  a  page-at-a-time 
interface.  The  following  is  the  complete  list  of  primitives  provided  by  the  Standard  Page  Set 
Subsystem; 


psid  3  PSCreate(type) 
val  >  PSDestroy(psid) 

npr  =  PSRead(psid,  pnum,  npages,  buffer) 

val  =  PSIsZero(psid,  pnum,  npages) 

npw  s  PSWrite(psid,  pnum,  npages,  buffer) 

npw  =  PSZero(psid,  pnum,  npages) 

val  a  PSMove(psid,  pnum,  npages,  to-pnum) 

val  =  PSSync((psid]) 

val  a  PSStatus(psid,  statusbuffer) 

val  a  PSRestart(disk-aid  [,  fast]) 


PSID  psid 

BOOLEAN  val 
INT  npr,  npw 
PSTYPE  type 
INT  pnum 
INT  npages 
BUFFER  'buffer 
INT  to-pnum 


The  identifier  for  the  page  set  (unique  within  the  logical  disk).  It  includes  an 
indication  of  the  page  set  type;  TEMPORARY,  PERMANENT,  or  DUAL. 

TRUE  if  the  specified  operation  is  completed  successfully:  otherwise  FALSE. 

The  actual  number  of  pages  read  or  written. 

The  type  of  page  set  to  be  created;  TEMPORARY,  PERMANENT,  or  DUAL. 

The  starting  page  number,  within  a  page  set,  for  the  specified  operation. 

The  number  of  pages  involved  in  the  specified  operation. 

The  buffer  address  in  the  kernel  address  space  of  either  the  local  or  remote  node. 
The  destination  page  number,  to  which  the  specified  pages  are  to  be  moved. 


PSSTATUSBUFFER  'statusbuffer 

The  buffer  address  for  returning  status  information  (can  be  either  a  local  or 
remote  kernel  address). 

AID  disk-aid  The  arobject  ID  of  the  Logical  Disk  Subsystem,  associated  with  the  disk  to  be 

used  by  this  instance  of  the  Standard  Page  Set  Subsystem. 


BOOLEAN  fast 


TRUE  if  this  is  to  be  a  fast  restart,  avoiding  all  of  the  disk  checking  and  garbage 
collection:  otherwise  FALSE.  The  default  is  FALSE. 


218 


The  PSCreate  primitive  is  used  to  create  a  new  page  set,  of  the  specified  type,  on  the  disk 
associated  with  this  instance  of  the  Standard  Page  Set  Subsystem.  The  identifier  for  the  new  page  set 
(psid),  which  is  unique  within  this  logical  disk,  is  returned.  Initially,  no  pages  are  actually  sdlocated  to 
the  page  set,  i.e.  reading  any  page  will  return  all  zeros.  The  PSDestroy  primitive  frees  all  pages 
belonging  to  the  specified  page  set  {psid),  and  removes  all  record  of  that  page  set  from  the  disk. 

PSRead  reads  multiple  {npages)  pages,  beginning  with  pnum,  from  page  set  psid,  into  the  given 
primary  memory  buffer.  If  the  buffer  is  located  on  a  remote  node,  the  required  copying  of'data  across 
the  network  will  oe  handled  automatically  by  the  Logical  Disk  Subsystem  (see  Section  3.7.3).  .Any 
pages  which  have  never  been  explicitly  written  will  be  read  as  all  zeros.  PSRead  returns  the  number 
of  (non-zero)  pages  which  were  actually  read.  PSisZero  can  l^e  used  to  check  whether  the  specified 
range  of  pages,  in  the  given  page  set,  are  all  zero  (unallocated).  If  so,  it  returns  TRUE;  otherwise 
FALSE. 

PSVyrite  writes  multiple  (npages)  pages  from  the  given  primary  memory  buffer,  into  page  set  psid, 
beginning  at  page  number  pnum.  The  number  of  pages  actually  written  is  returned.  As  with  PSRead, 
the  buffer  can  be  located  on  either  the  local  or  a  remote  node.  Any  necessary  cross-network  data 
copying  will  be  handled  automatically  by  the  Logical  Disk  Subsystem.  PSZero  provides  a  means  for 
efficiently  "zeroing"  (deallocating)  a  range  of  pages  within  a  page  set.  It  is  especially  useful  for 
"truncating  the  tails"  of  page  sets,  which  is  accomplished  by  zeroing  the  highest  numbered 
(allocated)  pages.  The  number  of  pages  actually  zeroed  (i.e.  the  number  of  previously  allocated 
pages  which  have  now  been  deallocated)  is  returned. 

PSMove  can  be  used  to  "move"  the  specified  range  of  pages  within  page  set  psid,  to  the  new 
location  specified  by  to-pnum.  Overlapping  source  and  destination  ranges  are  permitted,  with  the 
result  being  equivalent  to  first  deallocating  all  of  the  pages  in  the  source  and  destination  regions, 
followed  by  rewriting  the  original  source  pages  into  their  destination  locations.  Among  other  things, 
PSMove  can  be  used  for  "truncating  the  heads"  of  page  sets,  by  specifying  page  zero  as  the 
destination,  and  moving  all  of  the  tail  pages  forward. 

The  PSSync  primitive  ensures  that  any  buffered  information  within  the  Standard  Page  Set 
Subsystem  is  consistent  (synchronized)  with  the  information  on  secondary  memory.  If  a  particular 
page  set  (psid)  is  specified,  only  buffered  information  related  to  that  page  set  is  guaranteed  to  be 
consistent  with  the  secondary  memory  information.'*’ The  PSStatus  primitive  returns  (in  statusbuffer) 


41 

This  moy  require  much  less  time  than  synchronizing  all  ol  the  buffered  information. 


219 


information  about  the  specified  page  set  (psid).  This  includes  the  page  set  type,  its  maximum 
allocated  page  number  (virtual  size),  and  its  actual  number  of  allocated  pages  (physical  size). 

PSRestart  initializes  the  Standard  Page  Set  Subsystem,  for  the  logical  disk  specified  by  disk-aid.  It 
checks  to  ensure  that  the  main  secondary  memory  data  structure  (the  Standard  Page  Set  8-Tree, 
discussed  below)  is  accessible  and  consistent,  and  ensures  that  the  pages  of  any  dual  page  sets  are 
also  consistent,  by  using  the  DKRecoverDuat  primitive  of  the  Logical  Disk  Subsystem  (see  Section 
3.7.3).  A  side  effect  of  using  DKRecoverDual  is  that  all  of  the  pages  in  the  Standard  Page  Set  B-Tree, 
and  all  of  the  dual  page  set  data  pages,  will  be  marked  as  allocated.  PSRestart  then  marks  (as 
allocated)  all  of  the  data  pages  belonging  to  any  permanent  page  sets  stored  on  the  disk  (see  the 
discussion  of  DKMark  in  Section  3.7.3).  Finally,  PSRestart  “destroys"  any  temporary  page  sets  which 
still  reside  on  the  disk,  by  carefully  removing  their  entries  from  the  Standard  Page  Set  B-Tree. 
^^PSRestart  should  only  be  invoked  when  the  Standard  Page  Set  Subsystem  is  first  created  by  the 
Page  Set  Restart/Reconfiguration  Subsystem  (see  Section  3.7.4  below),  i.e.  when  the  associated 
logical  disk  is  first  added  to  the  system.  The  optional  parameter  (fast)  can  be  used  to  indicate  that  a 
fast  restart  is  in  progress,  and  hence  all  of  the  consistency  checks,  page  allocation  marking,  and 
removal  of  temporary  page  sets  can  be  skipped.  For  more  details  concerning  restart  and  garbage 
collection  activities,  see  Section  3.7.4. 

3.7.1 .1  Components  of  the  Standard  Page  Set  Subsystem 

Each  instance  of  the  Standard  Page  Set  Subsystem  kernel  arobject  (one  per  logical  disk)  has  the 
simple  structure  illustrated  in  Figure  3-23.  It  consists  of  two  main  components:  a  single  Manager 
process,  and  a  B-Tree  Buffer.  The  Manager  handles  all  of  the  Standard  Page  Set  primitives, 
discussed  above.  It  uses  the  facilities  of  the  associated  Logical  Disk  Subsystem,  in  order  to  store, 
access,  and  manipulate  the  page  sets  recorded  on  its  corresponding  logical  disk.  Note  that  the 
Standard  Page  Set  Subsystem  itself  does  no  buffering  of  data  pages.  It  is  left  to  the  Logical  Disk 
Subsystem  to  transfer  data  directly  into  and  out  of  client  buffers. 

The  B-Tree  Buffer  holds  the  most  recently  accessed  nodes  (pages)  of  the  Standard  Page  Set  B-Tree 
for  this  logical  disk.  The  Standard  Page  Set  B-Tree  is  the  data  structure  used  to  record  the  sets  of 
logical  disk  pages,  comprising  the  standard  page  sets  stored  on  this  disk.  The  use  of  a  B-tree  [Comer 

42 

It  IS  assumed  (unless  a  "fast"  restart  is  in  progress)  that  all  of  the  pages  on  the  logical  disK  hove  been  "treed"  prior  to 
invoking  PSReslait  As  a  result,  the  page  allocation  information  must  be  reconstructed,  and  until  it  is,  all  requests  to  allocate 
new  logical  disk  pages  must  be  avoiried.  PSPestart  must  mark  (as  allocated)  all  of  the  pages  which  are  purl  of  the  Standard 
Page  Set  Subsystem.  Since  none  of  the  data  pages  Irom  temporary  page  sets  are  marked,  they  are  automatically  freed. 
However,  the  rccoid  of  the  pages  allocated  to  temporary  page  sets  is  still  stored  in  the  Standard  Page  Set  ft-Tree,  and  must  be 
removed  The  romoval  of  entries  liom  the  B  Tree  is  a  "safe"  operation,  .-.inco  it  does  not  require  the  allocation  of  any  new 
logical  disk  pages  although  it  may  involve  the  freeing  of  some  pages  from  the  B-Tree. 


220 


Reply 

Result 


Reply 

Result 


Standard  Page  Set 
B-Tree  Buffer 


B-Trec 

Node 

B-Tree 

Node 

0-Tree 

Node 

B-Tree 

Node 

Figure  3-23;  Components  of  the  Standard  Page  Set  Subsystem 


79]  for  this  purpose  is  quite  similar  to  its  use  in  the  Xerox  Distributed  File  System  (XDFS)  [Sturgis 
80,  Mitchell  82].  The  logical  structure  of  the  Standard  Page  Set  B-Tree  is  illustrated  in  Figure  3-24. 


Root 

(Interior  Node)  Interior  Nodes  Leaf  Nodes 


Figure  3-24:  Standard  Page  Set  B-Tree 


221 


In  essence,  the  B-tree  maintains  a  sorted  list  of  data  pages,  ordered  by  page  set  ID  (ps/d),  and  page 
number  within  psid.  This  allows  the  mapping  from  {psid,  pnum)  to  the  logical  disk  page  number 
(Idpnum)  to  be  be  performed  very  quickly.  Each  node  of  the  B-tree  is  stored  as  a  dual  logical  disk 
page,  in  order  to  improve  the  reliability  of  the  data  structure.  In  addition  to  page  mapping  information, 
each  node  contains  a  small  amount  of  header  information,  which  indicates  the  type  of  node  (interior 
or  leaf),  and  the  number  of  page  map  entries.  Every  map  entry,  whether  in  an  interior  or  a  leaf  node, 
has  the  same  structure  (psid,  pnum,  Idpnum).  However,  the  type  of  page  pointed  to  by  Idpnum  will 
differ,  depending  on  whether  the  node  is  interior  or  a  leaf.  A  pointer  to  the  root  node  of  the  B-tree  (its 
Idpnum)  is  stored  in  page  zero  of  the.  logical  disk,  so  that  the  Standard  Page  Set  Subsystem  can 
always  find  its  B-tree  data  structure.'^ 

Note  that  the  type  of  each  page  set  (temporary,  permanent,  or  dual)  is  included  in  the  page  set  ID 
ipsid).  When  a  page  set  is  first  created,  an  entry  for  page  number  zero  is  added  to  the  B-tree,  with 
Idpnum  set  to  zero  to  indicate  that  the  page  has  not  rezdly  been  allocated  yet.  As  pages  are  written  or 
zeroed,  appropriate  entries  are  added  or  removed  from  the  B-tree.  However,  there  will  always  be  an 
entry  (whether  allocated  or  not)  for  page  zero  of  each  page  set  that  exists.  Note  that  the  structure  of 
the  B-tree  allows  all  of  the  status  information  for  each  page  set  (its  type,  logical  sire,  and  physical 
size)  to  be  determined  quite  easily.  It  also  permits  the  pages  of  a  page  set  to  be  written,  read,  and 
zeroed  very  efficiently. , 

In  terms  of  storage  overhead,  the  B-tree  structure  is  also  quite  reasonable.  If  we  assume  that  ps/d, 
pnum,  and  Idpnum  are  each  32  bits  (4  bytes)  long,  ther  a  single  B-tree  node  (2K  byte  page)  can 
contain  170  map  entries,  with  8  bytes  remaining  for  header  information.  Since  each  leaf  page  of  the 
B-tree  is  dual,  and  can  map  170  data  pages,  only  a  little  over  one  percent  of  the  logical  disk  pages 
belonging  to  the  Standard  Page  Set  Subsystem  should  be  needed  for  storing  the  B-tree  itself. 
Furthermore,  it  should  be  noted  that  with  170  entries  per  node,  a  three  level  B-tree  can  map  170^  data 
pages,  which  is  considerably  more  than  can  be  contained  on  any  disk  that  is  likely  to  be  used  as  an 
ArchOS  logical  disk.  Thus,  the  Standard  Page  Set  B-Tree  will  be  a  maximum  of  three  levels  deep. 

A  special  note  should  be  made  concerning  the  Standard  Page  Set  B-Tree  Buffer,  illustrated  earlier 
in  Figure  3-23.  Although  (for  clarity  of  exposition)  each  instance  of  the  Standard  Page  Set  Subsystem 
is  shown  as  having  its  own  buffer  area,  in  practice,  all  instances  on  a  single  node  would  share  a 
common  Kernel  Page  Buffer  Pool.  Indeed,  the  Kernel  Page  Buffer  Pool  would  be  shared  with  many 


43 

disk. 


Page  zero  of  each  logical  disk  is  a  special  di/a/  page,  which  contains  pointers  to  all  of  the  major  data  structures  on  the 
It  also  contains  information  about  the  disk  itself,  such  as  its  size  and  its  logical  disk  ID. 


222 


other  subsystems  on  that  node  as  well:  Logical  Disk  Subsystems,  Atomic  Page  Set  Subsystems,  File 
System  Disk  Directory  Management  Subsystems,  and  so  on.  Each  subsystem  allocates  pages  from 
the  Kernel  Page  Buffer  Pool  (in  least  recently  used  order)  as  required.  Pages  are  then  returned  to  the 
pool  (freed)  as  soon  as  they  are  no  longer  being  actively  used.  Each  subsystem  maintains  a  list  of  the 
“free"  pages  in  the  buffer  pool,  which  it  has  used  recently.  This  list  provides  "hints"  about  pages  in 
the  buffer  pool  which  may  still  contain  information  of  use  to  the  subsystem,  i.e.  they  are  buffer  pages 
which  can  be  reclaimed,  rather  than  reading  the  information  again  from  disk,  eissuming  the  pages 
have  not  already  been  reused  by  some  other  subsystem. 

3.7.2  Atomic  Page  Set  Subsystem 

The  Atomic  Page  Set  Subsystem  is  quite  similar  to  the  Standard  Page  Set  Subsystem,  except  that  it 
provides  the  "atomic  page  set"  abstraction.  The  primary  difference  between  the  standard  and  atomic 
page  set  abstractions  is  that  the  consistency  of  atomic  page  sets  will  be  maintained  in  the  event  of  a 
crash.  In  addition,  the  Atomic  Page  Set  Subsystem  allows  arbitrary  sequences  of  bytes  to  be  read  or 
v/ritten,  rather  than  restricting  the  operations  to  being  page-oriented.  Atomic  page  sets  are  primarily 
used  for  storing  atomic  abstract  data  type  instances,  such  as  atomic  files.  As  with  the  Standard  Page 
Set  Subsystem,  an  instance  of  the  Atomic  Page  Set  Subsystem  is  created  for  each  logical  disk  in  the 
distributed  computer  system. 

The  Atomic  Page  Set  Subsystem  provides  primitives  for  "committing”  arbitrarily  nested  elementary 
and  compound  transactions,  in  two  phases:  the  prepare  phase,  and  the  commit  phase.  These  two 
phases  can  be  used  by  the  Transaction  Management  Subsystem,  as  part  of  its  three  phase  commit 
protocol  (see  Section  3.5).  Using  the  mechanisms  provided  here,  it  is  possible  to  atomically  commit 
(or  to  abort)  transactions  which  span  multiple  logical  disks,  and  multiple  nodes.  It  should  be  noted 
that  the  Atomic  Page  Set  Subsystem  only  provides  support  for  the  properties  of  failure  atomicity,  and 
durability  (Eswaran  76,  Gray  01].  The  sequencing  and  scheduling  of  concurrent  transactions,  so  as 
to  ensure  consistency,  is  assumed  to  be  handled  by  the  Transaction  Management  Subsystem.  The 
actual  primitives  provided  by  the  Atomic  Page  Set  Subsystem  are  the  following: 


223 


psid  3  APSCreate(tid) 
val  =  APSDestroy(tid,  psid) 

nbr  s  APSRead(tid,  psid,  location,  nbytes,  buffer) 

val  3  APSIsZero(tid,  psid,  location,  nbytes) 

nbw  s  APSWrite(tid,  psid,  location,  nbytes,  buffer) 

nbw  =  APS2ero(tid,  psid,  location,  nbytes) 

val  =  APSMove(tid,  psid,  location,  nbytes,  to-location) 

val  s  APSPrepareCommit(tid) 
val  =  APSCofnmit(tid) 
val  a  APSAbort(tid) 

val  =  APSStatus(psid,  statusbuffer) 
val  a  APSRestart(disk-aid  [,  fast]) 

PSID  psid  The  identifier  for  the  atomic  page  set  (unique  within  the  logical  disk).  It  includes 

an  indication  of  the  page  set  type  (ATOMIC). 

BOOLEAN  val  TRUE  if  the  specified  operation  is  completed  successfully:  otherwise  FALSE. 

INT  nbr,  nbw  The  actual  number  of  bytes  read  or  written. 

TID  tid  The  ID  of  the  transaction  to  which  this  operation  belongs.  From  the  tid  it  is 

possible  to  determine  the  transaction  type,  as  well  as  the  parent  transaction  ID  (if 
this  is  a  nested  transaction). 

INT  location  The  starting  location  (byte  offset  from  the  beginning  of  the  atomic  page  set)  for 

the  specified  operation. 

INT  nbytes  The  number  of  bytes  Involved  in  the  specified  operation. 

BUFFER  'buffer  The  buffer  address  in  the  kernel  address  space  of  either  the  local  or  remote  node. 

INT  to-locatlon  The  destination  location,  to  which  the  specified  bytes  are  to  be  moved. 

APSSTATUSBUFFER  'statusbuffer 

The  buffer  address  for  returning  status  information  (can  be  either  a  local  or 
remote  kernel  address). 

AID  disk-aid  The  arobject  ID  of  the  Logical  Disk  Subsystem,  associated  with  the  disk  to  be 

used  by  this  instance  of  the  Atomic  Page  Set  Subsystem. 

BOOLEAN  fast  TRUE  if  this  is  to  be  a  fast  restart,  avoiding  all  of  the  disk  checking  and  garbage 
collection;  otherwise  FALSE.  The  default  is  FALSE. 


224 


The  APSCreate  primitive  is  used  to  create  a  new  atomic  page  set,  on  the  disk  associated  with  this 
instance  of  the  Atomic  Page  Set  Subsystem.  A  transaction  ID  (fid)  is  specified,  so  that  the  new  atomic 
page  set  will  only  come  into  permanent  existence  when  transaction  tid  commits.^Until  then,  the  new 
page  set  is  only  visible  within  this  transaction,  or  within  any  nested  subtransactions  of  tid.  APSCreate 
returns  the  identifier  for  the  new  atomic  page  set  (paid),  which  is  unique  within  this  logical  disk. 
Initially,  no  pages  are  actually  allocated  to  the  page  set,  i.e.  reading  any  part  of  it  will  return  all  zeros. 
The  APSDestroy  primitive  frees  all  pages  belonging  to  the  specified  atomic  page  set  (ps/d),  and 
removes  all  record  of  that  page  set  from  the  disk.  As  was  the  case  with  APSCreate,  the  specified 
transaction  ID  (tid)  determines  when  the  effects  of  APSDestroy  will  be  permanently  committed,  as  well 
as  the  scope  of  their  visibility  in  the  interim. 

APSRead  reads  nbytes  bytes  into  the  given  buffer,  beginning  from  byte  location  in  atomic  page  set 
ps/d.  The  transaction  ID  (tid)  is  taken  into  account  when  determining  which  modifications  (if  any)  to 
the  specified  bytes,  by  transactions  which  have  not  yet  committed,  should  be  visible  to  this  read 
operation.  If  the  buffer  is  located  on  a  remote  node,  the  data  will  be  automatically  copied  across  the 
network,  using  the  special  kernel  Copy  primitive.  Any  bytes  which  have  never  been  explicitly  written 
will  be  read  as  zeros.  APSRead  returns  the  number  of  (non-zero)  bytes  which  were  actually  read. 
APSisZero  can  be  used  to  check  whether  the  specified  range  of  bytes,  in  the  given  atomic  page  set, 
are  all  zero.  If  so,  it  returns  TRUE;  otherwise  FALSE.  The  specified  transaction  ID  serves  the  same 
purpose  as  in  the  case  of  APSRead. 

^PSWrite  writes  nbytes  bytes  from  the  given  buffer,  into  atomic  page  set  ps/d,  beginning  at  byte 
location.  The  number  of  bytes  actually  written  is  returned.  The  specified  transaction  ID  (tid) 
deter/nines  when  the  effects  of  APSWrite  will  be  permanently  committed,  as  well  as  the  scope  of  their 
visibility  in  the  interim.  As  with  APSRead,  the  buffer  can  be  located  on  either  the  local  or  a  remote 
node.  Any  necessary  cross-network  data  copying  will  be  handled  autoinatically,  using  the  special 
kernel  Copy  primitive.  APSZero  provides  a  means  for  efficiently  “zeroing”  a  range  of  bytes  within  an 
atomic  page  set.  The  specified  transaction  ID  serves  the  same  purpose  as  in  the  case  of  APSWrite. 
When  complete  pages  are  zeroed,  they  are  removed  from  the  page  set  (deallocated).  APSZero  is 
especially  useful  for  "truncating  the  tails”  of  atomic  page  sets,  which  is  accomplished  by  zeroing  the 
highest  numbered  (non-zero)  bytes.  The  number  of  bytes  actually  zeroed  (i.e.  the  number  of 
previously  non-zero  bytes  which  have  now  been  zeroed)  is  returned. 


"*^11  lid  is  a  nested  etementary  transaction,  the  new  atomic  page  set  v/ill  only  come  into  permanent  exi.stonce  when  the 
top-level  parent  ol  tid  commits. 


225 


APSMove  can  be  used  to  “move"  the  specified  range  of  bytes  within  atomic  page  set  psid,  to  the 
new  location  specified  by  to-iocation.  The  transaction  ID  {tid)  serves  the  same  purpose  as  in  the  case 
of  APSWrite  and  APSZero.  Overlapping  source  and  destination  ranges  are  permitted,  with  the  result 
being  equivalent  to  first  zeroing  ail  of  the  bytes  in  the  source  and  destination  regions,  followed  by 
rewriting  the  original  source  bytes  into  their  destination  locations.  Among  other  things,  APSMove  can 
be  used  for  "truncating  the  heads"  of  atomic  page  sets,  by  specifying  location  zero  as  the 
destination,  and  moving  all  of  the  tail  bytes  forward. 

The  APSPrepareCommit  primitive  is  used  in  phase  one  of  the  two  phase  transaction  commit 
sequence  for  atomic  page  sets.  All  modifications  to  atomic  page  sets  on  this  logical  disk,  which  were 
made  in  the  course  of  the  specified  transaction  (tid),  are  carefully  recorded  in  a  special  “Commit  List” 
on  secondary  memory.  These  modifications  can  no  longer  be  lost  in  the  event  of  a  crash. 
Transaction  t/d  is  flagged  (on  the  disk)  as  “prepared  to  commit”.  The  invoker  is  then  informed 
(through  the  return  value,  va/),  of  the  Atomic  Page  Set  Subsystem’s  readiness  to  commit  the 
transaction.  Following  successful  completion  of  the  APSPrepareCommit  primitive,  the  only  allowed 
operation  involving  transaction  tid  is  APSCommit  or  APSAbort.  Note  that  APSPrepareCommit  need 
not  be  invoked  in  the  case  of  a  nested  elementary  subtransaction  (it  has  no  effect).  Such  a 
subtransaction  does  not  actually  modify  the  permanent  contents  of  the  atomic  page  sets,  until  its  top 
level  parent  transaction  commits.  Hence,  a  two  phase  commit  sequence  is  unnecessary,  since  any 
failure  will  cause  the  parent  transaction,  as  well  as  this  “committed"  subtransaction,  to  be  aborted. 

APSCommit  is  used  in  phase  two  of  the  two  phase  transaction  commit  sequence.  It  is  also  the  only 
primitive  required  to  “commit"  a  nested  elementary  subtransaction.  In  this  latter  case,  APSCommit 
does  not  actually  modify  the  permanent  contents  of  the  atomic  page  sets.  Instead,  it  simply  makes  all 
of  the  modifications  which  were  made  in  the  course  of  the  specified  transaction  [tid)  visible  to  the 
parent  of  tid.  When  APSCommit  is  applied  to  a  top  level  elementary  transaction,  or  to  any  compound 
transaction,  the  specified  transaction  [tid)  must  already  have  been  “prepared”  (with 
APSPrepareCommit),  in  phase  one  of  the  commit  sequence.  In  this  case  APSCommit  is  the  second 
(final)  phase  of  the  commit  sequence,  and  it  is  responsible  for  carefully  updating  the  “permanent” 
contents  of  the  atomic  page  sets,  based  on  the  modification  information  contained  in  the  Commit  List. 
APSCommit  first  flags  (on  the  disk)  transaction  tid  as  “committed”.  It  then  returns  the  result  [val)  to 
the  invoker,  allowing  the  invoker  to  continue  its  execution  while  the  Atomic  Page  Set  Subsystem 
makes  the  required  modifications  to  the  atomic  page  sets. 

The  APCAbort  primitive  is  used  to  abort  the  specified  transaction  [tid).  This  involves  removing 
(undoing)  all  modifications  made  within  transaction  lid,  or  any  of  its  nested  sublransactions  (except 


226 


already  committed  compound  subtransactions).  Note  that  transactions  which  have  been  "prepared”, 
but  not  yet  committed,  can  still  be  aborted.  In  that  case  the  aborted  modifications  must  be  carefully 
removed  from  the  Commit  List. 

The  APSStatus  primitive  returns  (in  statusbuffer)  information  about  the  specified  atomic  page  set 
{paid).  This  includes  its  virtual  size  (maximum  non-zero  byte  location),  physical  size  (actual  number  of 
allocated  bytes),  and  an  indication  of  which  uncommitted  transactions,  if  any,  are  still  operating  on 
this  page  set.  Note  that  since  no  transaction  ID  is  specified  with  the  APSStatus  primitive,  the  sizes 
reported  are  those  of  the  current,  "permanent"  version  of  the  atomic  page  set. 

APSRestart  initializes  the  Atomic  Page  Set  Subsystem,  for  the  logical  disk  specified  by  disk-aid.  It 
checks  to  ensure  that  the  main  secondary  memory  data  structures  (the  Atomic  Page  Set  Permanent 
and  Commit  B-Trees,  discussed  below)  are  accessible  and  consistent,  by  using  the  DKRecoverDual 
primitive  of  the  Logical  Disk  Subsystem  (see  Section  3.7.3).  A  side  effect  of  using  DKRecoverDual  is 
that  all  of  the  pages  in  the  Permanent  and  Commit  B-Trees  will  be  marked  as  allocated.  APSRestart 
then  marks  (as  allocated)  all  of  the  data  pages  belonging  to  any  atomic  page  sets  stored  on  the  disk, 
and  also  marks  all  of  the  data  pages  associated  with  the  Commit  List  (see  the  discussion  of  DKMark  in 
Section  3.7.3).^Finally,  APSRestart  completes  the  processing  of  any  transaction  which  is  in  the 
Commit  List,  and  flagged  as  "committed”.  This  involves  carefully  updating  the  permanent  contents  of 
the  atomic  page  sets,  based  on  the  modification  information  contained  in  the  Commit  List. 
^APSRestart  should  only  be  invoked  when  the  Atomic  Page  Set  Subsystem  is  first  created  by  the 
Page  Set  Restart/Reconfiguration  Subsystem  (see  Section  3.7.4  below),  i.e.  when  the  associated 
logical  disK  is  first  added  to  the  system.  The  optional  parameter  (fast)  can  be  used  to  indicate  that  a 
fast  restart  is  in  progress,  and  hence  all  of  the  consistency  checks  and  page  allocation  marking  can 
be  skipped.  However,  any  remaining  commit  processing  must  still  be  performed,  even  in  the  case  of  a 
fast  restart.  For  more  details  concerning  restart  and  garbage  collection  activities,  see  Section  3.7.4. 

3.7.2. 1  Components  of  the  Atomic  Page  Set  Subsystem 

Each  instance  of  the  Atomic  Page  Set  Subsystem  kernel  arobject  (one  per  logical  disk)  has  the 
general  structure  illustrated  in  Figure  3-25.  It  consists  of  five  main  components:  a  single  Manager 
process,  three  B-Tree  Buffers,  and  a  Data  Page  Buffer.  The  Manager  handles  all  of  the  Atomic  Page 


45 

At  this  point,  since  PSRostart  is  assumed  to  have  been  invoked  prior  to  APSPestan,  all  of  the  pane  allocation  information 
tor  the  logical  disk  has  been  reconstructed.  Hence,  it  is  now  safe  to  allocate  new  logical  disk  pages,  as  required,  without 
danger  of  reallocating  pages  that  are  already  in  use. 

46 

Note  that  commit  processing  can  be  done  "in  the  background"  (after  returning  the  result  va/  to  the  invoker),  just  as  in  the 
case  of  APSCommit. 


227 


Set  primitives,  discussed  above.  It  uses  the  facilities  of  the  associated  Logical  Disk  Subsystem,  in 
order  to  store,  access,  atnd  manipulate  the  page  sets  recorded  on  its  corresponding  logical  disk. 


Rtply 

RituU 


Reply 

Result 


Atomic  Ptg*  Sot 

Atomic  Pago  Sat 

Commit 

B-Tr»o  Bufftr 

B-Troo  Buffor 

Atomic  Pago  Sat 

Atomic  Page  Sat 

Modification 

Data  Poga 

B-Tr»*  Buffor 

B-Traa  Buffor 

Figu  re  3-25:  Components  of  the  Atomic  Page  Set  Subsystem 


The  four  buffers  in  Figure  3-25  are  primarily  shown  for  clarity  of  exposition.  In  practice  (as 
explained  in  Section  3.7.1),  each  instance  of  (he  Atomic  Page  Set  Subsystem  would  not  have  its  own, 
separate  buffer  areas.  Instead,  it  would  allocate  buffer  pages  from  a  common  Kernel  Page  Buffer 
Pool,  and  maintain  "hints"  lists  concerning  those  pages  which  it  most  recently  returned  to  the  pool. 
For  our  purposes,  however,  it  is.  more  convenient  (and  not  entirely  inaccurate)  to  regard  each 
instance  of  the  Atomic  Page  Set  Subsystem  as  having  its  own  buffer  areas. 

The  Permanent  B-Tree  Buffer  holds  the  most  recently  accessed  nodes  (pages)  of  the  Atomic  Page 
Set  Permanent  B-Tree,  for  this  logical  disk.  The  Permanent  B-Tree  is  the  data  structure  used  to 
record  the  sets  of  logical  disk  pages,  comprising  the  current,  permanent  versions  of  the  atomic  page 
sets,  stored  on  this  disk.  The  logical  structure  of  the  Permanent  B-Tree  is  identical  to  that  of  the 
Standard  Page  Set  B-Tree,  as  discussed  at  length  in  Section  3.7.1,  and  illustrated  in  Figure  3-24. 
Refer  to  that  Section  for  details.  However,  updating  the  Permanent  B-Tree  must  be  done  more 
carefully  than  in  the  case  of  the  Standard  Page  Set  B-T ree,  since  in  this  case  the  update  must  be 
done  atomically  with  respect  to  system  failures.  The  technique  for  carefully  (atomically)  updating  the 
Permanent  B-Tree  is  outlined  briefly  in  the  discussion  of  transaction  commit  handling,  below.  As  with 
the  Standard  Page  Set  B-Tree.  a  pointer  to  the  root  node  of  the  Permanent  B-tree  (its  Idpnum)  is 
stored  in  page  zero  of  the  logical  disk,  so  that  the  Atomic  Page  Set  Subsystem  can  always  find  it. 


228 


The  Modification  B-Tree  Buffer  holds  the  most  recently  accessed  pages  of  the  Atomic  Page  Set 
Modification  B-Tree.  The  Modification  B-Tree  is  a  temporary  data  structure,  which  contains  the  list  of 
modifications  that  have  been  made  to  atomic  page  sets,  in  the  course  of  the  currently  active, 
uncommitted  transactions.  Although  the  modification  list  is  usually  expected  to  be  quite  short, 
storing  it  as  a  B-tree  will  allow  it  to  grow  arbitrarily  large,  when  needed.  The  general  structure  of  the 
Modification  B-Tree  is  similar  to  that  of  the  Permanent  B-Tree  and  the  Standard  Page  Set  B-Tree  (see 
Figure  3-24).  However,  the  nodes  of  the  tree  are  stored  in  normal  (rather  than  dual)  pages,  since  this 
is  a  temporary  data  structure  that  will  be  “thrown  away”  automatically  following  a  crash.  The 
structure  of  a  leaf  node  of  the  Modification  B-Tree,  along  with  its  associated  data  pages,  is  illustrated 
in  Figure  3-26. 


The  Modification  List  (B-Tree)  is  ordered  by  (f/d,  psid,  pnum),  amd  allows  the  mapping  from  tfiese 
triples  to  the  associated  page  modifications,  to  be  performed  very  quickly.  Page  modifications  are 
represented  by  a  Data  Page,  containing  the  new  values  for  the  modified  bytes  within  the  page,  and  a 
Mask  Page,  which  indicates  which  bytes  have  been  modified.  Modified  bytes  are  indicated  by  a 
corresponding  Maisk  Page  byte  with  hexadecimal  value  FF.  Unmodified  bytes  have  value  zero  in  both 


229 


the  Data  Page  and  the  Mask  Page.'*^lf  the  entire  Data  Page  has  been  modified,  then  the  Mask  Page 
can  be  omitted  (mask  Idpnum  =  0).  If  the  entire  Data  Page  has  been  zeroed,  then  the  Data  Page  can 
also  be  omitted  (data  Idpnum  =  0).  One  other  special  case  is  that  of  atomic  page  sets  that  have  been 
destroyed.  In  this  case  the  Modification  List  contains  only  an  entry  for  page  zero  of  the  destroyed 
page  set,  and  mask  Idpnum  =  data  Idpnum  =  DESTROYED.  Use  of  the  Modification  List  when 
accessing  and  modifying  the  atomic  page  sets,  and  when  committing  or  aborting  transactions,  is 
discussed  below  in  Sections  3.7.2.2  and  3.7.2.3. 

Another  major  data  structure  of  the  Atomic  Page  Set  Subsystem,  illustrated  earlier  in  Rgure  3-25,  is 
the  Commit  B-Tree  Buffer.  It  holds  the  most -recently  accessed  pages  of  the  Atomic  Page  Set  Commit 
B-Tree,  which  is  the  data  structure  used  to  record  the  Commit  List  (the  list  of  atomic  page  set 
modifications,  made  by  transactions  which  are  in  the  process  of  being  committed).  As  with  the 
Modification  List,  the  Commit  List  is  usually  expected  to  be  quite  short.  However,  storing  it  as  a  B-tree 
will  allow  it  to  grow  arbitrarily  large,  when  needed.  The  structure  of  the  Commit  B-Tree  is  almost 
identical  to  that  of  the  Modification  B-Tree,  except  that  the  nodes  of  the  tree  are  stored  in  dual  (rather 
than  normal)  pages.  See  Figure  3-26  above  for  an  illustration  of  the  structure  of  a  leaf  node  of  the 
Commit  B-Tree.  Updating  the  Commit  B-Tree  must  be  done  carefully  (i.e,  atomically  with  respect  to 
system  failures),  just  as  in  the  case  of  updating  the  Permanent  B-Tree.  This  is  to  avoid  any  possible 
corruption  or  accidental  loss  of  transactions,  after  they  have  been  flagged  as  “prepared  to  commit". 
A  pointer  to  the  root  node  of  the  Commit  B-Tree  is  stored  in  page  zero  of  the  logical  disk,  so  that  the 
Atomic  Page  Set  Subsystem  can  always  find  it  at  restart  time. 

The  final  major  component  of  the  Atomic  Page  Set  Subsystem  is  the  Data  Page  Buffer.  Unlike  the 
Standard  Page  Set  Subsystem,  the  Atomic  Page  Set  Subsystem  must  often  manipulate  subparts  (byte 
ranges)  within  data  pages.  The  relevant  pages  are  held  in  the  Data  Page  Buffer  while  they  are  being 
operated  on.  Similarly,  modification  Mask  Pages  are  constructed,  accessed,  and  modified  by  the 
Atomic  Page  Set  Subsystem,  using  the  Data  Page  Buffer  for  temporary  storage. 

3. 7. 2. 2  Accessing  and  Modifying  Atomic  Page  Sets 

Any  access  or  modification  to  an  Atomic  Page  Set  must  be  done  in  the  context  of  a  transaction, 
specified  by  tid.  All  modifications  made  in  the  course  of  a  particular  transaction  are  first  recorded 
(temporarily)  in  the  Modification  List.  For  example,  assume  that  the  following  invocation  has  been 
made; 


‘'^Although  a  bit  map  could  be  used  in  place  of  the  current  (byte  map)  Mask  Page,  the  page  oriented  nature  of  the  logical 
disk  on  which  the  mops  are  stored  makes  the  current  design  more  convenient.  In  addition,  byte  manipulations  are  usually 
more  convenient  and  efficient  than  bit  manipulations,  in  most  current  programming  languages  and  harrtware  architectures. 


230 


nbw  =  APSWrite(tid,  psid,  location,  nbytes,  buffer) 

Then  the  Modification  List  will  be  searched  for  {tid,  psid,  pnum),  where  pnum  is  the  page  containing 
the  bytes  specified  by  location  and  nbytes.^\f  not  found,  a  new  entry  with  that  key  will  be  inserted  in 
the  list,  with  pointers  to  newly  allocated  (and  completely  zeroed)  Mask  and  Data  Pages.  In  either 
case,  the  specified  bytes  of  the  Data  Page  are  set  to  the  contents  of  buffer,  and  the  corresponding 
bytes  of  the  Mask  Page  are  "turned  on"  (set  to  hexadecimal  FF). 

Accessing  the  current  contents  of  an  atomic  page  set  is  a  little  more  involved.  For  example,  assume 
that  the  following  invocation  has  been  made: 

nbr  =  APSRead(tid,  psid,  location,  nbytes,  buffer) 

Then  tid  must  be  taken  into  account  when  determining  which  modifications  (if  any)  have  been  made 
to  the  specified  bytes,  and  should  be  visible  within  this  transaction.  The  Modification  List  is  first 
searched  for  (tid,  psid,  pnum),  where  pnum  is  the  page  containing  the  bytes  specified  by  location  and 
nbytes.  The  Modification  List  is  then  successively  searched  for  the  ancestor  transactions  of  tid,  to 
see  if  any  of  them  have  modified  the  page  in  question.  As  relevant  entries  are  found,  a  Compound 
Mask  and  Compound  Data  Page  are  constructed,  such  that  the  modifications  made  in 
subtransactions  take  precedence  over  their  ancestors.  This  can  be  easily  accomplished  (at  least 
conceptually)  using  bitwise  logical  operations  on  entire  Mask  and  Data  Pages.  For  example,  to  merge 
another  (ancestor)  Data  Page  (Data)  and  Mask  Page  (Mask)  into  the  Compound  Data  and  Mask  Pages 
that  have  been  constructed  thus  far,  the  following  equations  can  be  used: 

CompoundData  3  CompoundData  OR  (Data  AND  (NOT  CompoundMask)) 

CompoundMask  3  CompoundMask  OR  Mask 

If,  at  any  point  in  the  construction  of  the  Compound  Data  and  Compound  Mask  Pages,  it  is  found 
that  the  entire  page  has  been  modified,  then  there  is  no  need  to  search  any  further.  The  requested 
bytes  can  be  obtained  directly  from  the  Compound  Data  Page,  and  returned  in  buffer.  Otherwise,  the 
modifications  indicted  by  the  Compound  Mask  and  Compound  Data  Pages  (if  any)  must  be  applied 
(conceptually)  to  the  corresponding  page  in  the  Atomic  Page  Set  Permanent  B-Tree.  This  is  done 
using  the  same  formula  as  shown  above,  where  in  this  case  "Data"  is  the  Permanent  Data  Page.  The 
requested  bytes  can  then  be  obtain  directly  from  the  Compound  Data  Page  (as  before),  and  returned 
in  buffer. 

Note  that,  because  of  the  buffering  provided  in  the  Atomic  Page  Set  Subsystem,  most  manipulations 
of  the  Modification  List,  including  Mask  and  Data  Page  manipulations,  will  not  require  actual  disk 


siniplify  our  rixampfes,  we  will  omit  the  detoil.s  concerning  the  handling  of  miiltiple  pages,  and  the  crossing  ot  page 
houndaries. 


231 


operations.  Furthermore,  page-at-a-time  access  to  atomic  page  sets  will  usually  be  much  more 
efficient  than  manipulating  a  small  number  of  bytes  with  each  operation,  and  it  can  be  made  even 
more  efficient  by  treating  it  as  a  special  case. 

3. 7. 2. 3  Transaction  Commit  and  Abort  Handling 

All  modifications  to  atomic  page  sets  are  first  made  on  a  temporary  basis,  by  recording  them  in  the 
Modification  List,  as  outlined  above.  A  modification  will  only  become  “permanent"  after  its 
associated  transaction  (tid)  has  been  committed.  In  the  case  of  a  nested  elementary  subtransaction, 
a  modification  can  only  become  permanent  after  the  top  level  ancestor  transaction  commits.  This  is 
because  any  failure  (or  abort)  of  an  ancestor  transaction,  will  cause  the  modifications  made  in  any 
elementary  subtransactions  to  also  be  aborted.  When  a  nested  elementary  subtransaction  {tid) 
commits,  all  of  its  modifications  become  visible  to  its  parent  transaction  (parent-tid).  This  is 
accomplished  by  simply  “renaming"  all  of  the  tid  entries  in  the  Modification  List,  to  have  transaction 
ID  parent-tid.  In  case  of  conflicts,  the  modifications  made  in  tid  take  precedence  over  the 
modifications  made  in  parent-tid.  This  is  done  by  constructing  a  Compound  Mask  and  Compound 
Data  Page,  using  the  same  equations  as  discussed  above. 

Committing  a  top  level  elementary  transaction,  or  any  compound  transaction  (whether  nested  or 
not)  causes  all  of  its  modifications  to  be  permanently  made  to  the  affected  atomic  page  sets.  These 
modifications  must  be  made  very  carefully,  so  as  to  preserve  the  atomic  properties  of  the  page  sets 
(failure  atomicity  and  durability).  Since  a  transaction  may  span  more  than  one  logical  disk,  and  more 
than  one  computing  node,  a  two  phase  commit  sequence  is  needed  to  coordinate  the  atomic  updates 
of  the  various  Atomic  Page  Set  Subsystems,  and  to  make  them  all  occur  as  a  single  atomic  update.  In 
the  first  phase  of  the  transaction  commit  sequence,  the  APSPrepareCommit  primitive  is  invoked  on 
each  of  the  Atomic  Page  Set  Subsystems  involved  in  transaction  tid.  In  response  to 
APSPrepareCommit,  each  Atomic  Page  Set  Subsystem  must  carefully  record,  in  its  Commit  List,  all  of 
the  modifications  for  transaction  tid,  so  that  they  can  no  longer  be  lost  in  the  event  of  a  crash.  This 
basically  involves  moving  all  of  the  entries  with  transaction  ID  tid,  from  the  Modification  List  to  the 
Commit  List.  However,  to  avoid  inconsistencies  in  the  Commit  List,  and  the  attendant  loss  of 
previously  committed  transactions,  any  updates  to  the  Commit  List  must  be  done  atomically  with 
respect  to  system  failures.  The  technique  for  atomically  updating  the  Commit  List  B-tree  is  the  same 
as  that  outlined  below,  for  the  case  of  the  Atomic  Page  Set  Permanent  B-Tree. 

After  all  of  the  Atomic  Page  Set  Subsystems  have  indicated  that  they  are  prepared  to  commit 
transaction  tid,  the  second  phase  of  the  commit  sequence  can  begin.  Phase  two  is  signalled  to  each 
of  the  Atomic  Page  Set  Subsystems  by  invoking  the  APSCommit  primitive.  In  response  to 


APSCommit,  each  subsystem  simply  flags  transaction  tid  as  committed,  and  replies  that  it  has 
completed  the  request.  This  allows  the  invoker  to  continue  its  execution,  while  the  Atomic  Page  Set 
Subsystem  actually  carries  out  the  required  commit  processing.  Flagging  a  transaction  as  committed 
is  accomplished  by  writing  its  transaction  10  (tid)  into  the  header  portion  of  the  root  node  of  the 
Commit  List  B-Tree.  Note  that  since  the  nodes  of  the  B-tree  are  stored  as  dual  logical  disk  pages, 
updating  the  root  node  on  disk  will  be  an  atomic  operation. 

Once  a  transaction  has  been  flagged  as  committed,  an  Atomic  Page  Set  Subsystem  will  not  accept 
any  other  requests  for  operations  on  atomic  page  sets,  until  it  hzis  completed  the  necessary  commit 
processing.  Committing  transaction  tid.  basically  requires  carefully  (atomically)  updating  the 
“permanent"  contents  of  the  atomic  page  sets  (as  stored  in  the  Atomic  Page  Set  Permanent  B-Tree), 
based  on  the  modification  information  contained  in  the  Commit  List.  The  possible  types  of 
modifications  are:  (1)  the  insertion  of  new  pages  (including  entirely  new  page  sets):  (2)  the 
modification  of  existing  pages;  (3)  the  removal  (zeroing)  of  existing  pages:  and  (4)  the  removal 
(destruction)  of  entire  page  sets.  The  key  to  making  all  of  the  modifications  for  transaction  tid  occur 
atomically,  is  to  never  actually  modify  any  existing  node  or  data  pages  in  the  Permanent  B-Tree 
(except  one).  Instead  new  pages,  reflecting  the  required  modifications,  are  constructed  as 
necessary.  Then,  all  of  the  modifications  can  be  made  permanent  at  the  same  time,  by  atomically 
updating  a  single  Permanent  B-tree  node:  the  node  at  the  root  of  the  (minimum)  subtree  which 
encompasses  sill  of  the  modifications.  For  more  details  concerning  this  technique,  see  [Mitchell  82], 
where  the  XDFS  approach  to  atomically  updating  a  file  system  B-tree  is  discussed. 

After  the  Permanent  B-Tree  has  been  atomically  updated,  the  modifications  for  transaction  tid  must 
be  carefully  removed  from  the  Commit  List,  and  tid  erased  from  the  header  of  the  Commit  List  root 
node.  This  updating  of  the  Commit  List  can  all  be  done  atomically,  using  the  same  technique  as 
outlined  above  for  the  Permanent  B-Tree.  Note,  however,  that  there  is  a  period  of  time  between  the 
atomic  updating  of  the  Permanent  B-Tree  and  the  atomic  updating  of  the  Commit  List,  during  which  a 
system  failure  could  occur.  If  a  failure  occurs  during  that  time,  the  Atomic  Page  Set  Subsystem  will 
automatically  (upon  system  restart)  attempt  to  apply  the  modifications  for  transaction  tid  once  again. 
This  will  not  cause  any  problems,  however,  since  all  of  the  possible  modifications  to  the  Permanent 
B-Tree,  as  listed  earlier,  are  idempotent  (i.e.  they  can  be  repeated  multiple  times  without  changing  the 
final  result).  When  at  last  the  Atomic  Page  Set  Subsystem  is  able  to  complete  the  commit  processing 
for  transaction  tid,  it  can  then  return  to  accepting  new  requests  for  operations  on  atomic  page  sets. 

Besides  committing,  the  other  possible  outcome  for  a  transaction  is  that  it  aborts.  A  transaction 
(tid)  can  be  aborted  at  any  time,  prior  to  being  committed  with  APSCommit.  The  Atomic  Page  Set 


233 


Subsystem  is  notified  of  transaction  aborts  by  means  of  the  APSAbort  primitive.  When  a  transaction  is 
aborted,  all  of  the  modifications  made  within  it,  or  any  of  its  nesteid  subtransactions  (except  already 
committed  compound  subtransactions),  must  be  removed  (undone).  This  is  accomplished  by 
removing  all  entries  for  transaction  tid,  and  all  of  its  subtransactions,  from  both  the  Modification  List 
and  the  Commit  List.  Note  that  if  entries  are  to  be  removed  from  the  Commit  List,  it  must  be  done 
atomically,  using  the  standard  technique  outlined  above. 

3.7.3  Logical  Disk  Subsystem 

Each  physical  disk  that  is  being  used  by  the  ArchOS  system  has  a  corresponding  Logical  Disk 
Subsystem  Kernel  Arobject.  These  Arobjects  are  created  and  destroyed  as  disks  are  mounted  and 
dismounted  (see  Restart/Reconfiguration  Subsystem,  Section  3.7.4).  A  physical  disk  can  have  at 
most  one  ArchOS  partition  on  it,  and  only  that  portion  of  the  disk  will  be  used  by  ArchOS.  We  refer  to 
the  ArchOS  partition  of  a  disk  as  a  logical  disk.  A  logical  disk  can  simply  be  viewed  as  a  sequence  of 
pages,  where  a  page  is  (currently)  defined  to  be  2K  bytes  in  size.  Pages  on  the  disk  are  numbered 
from  0  to  N.1,  where  N  is  the  total  number  of  pages  on  the  logical  disk.  A  logical  disk  has  a  unique 
identifier  permanently  associated  with  it,  regardless  of  where  the  disk  is  currently  mounted. 

Each  instance  of  the  Logical  Disk  Subsystem  is  responsible  for  managing  the  allocation  of  pages  on 
its  corresponding  logical  disk.  It  also  keeps  track  of  the  bad  (unusable)  pages  on  the  disk,  so  that 
they  are  not  made  available  for  allocation  and  later  use.  As  much  as  possible,  the  Logical  Disk 
Subsystem  ” optimizes”  access  to  the  physical  disk,  by  allocating,  reading,  and  writing  multiple  pages 
at  a  time,  and  ordering  operations  so  as  to  minimize  disk  head  movement.  It  also  provides  a  degree  of 
"network  transparency”,  by  automatically  buffering  and  transferring  data  between  the  local  and 
remote  machines. 

In  addition  to  NORMAL  (single)  pages,  the  Logical  Disk  Subsystem  also  supports  the  concept  of 
DUAL  pages.  A  DUAL  page  is  written  more  carefully  than  a  NORMAL  page,  so  that  if  a  crash  occurs 
while  writing  a  DUAL  page,  its  original  contents  can  still  be  recovered.  In  this  way,  a  simple,  basic 
form  of  failure  atomicity  is  provided.  As  the  name  suggests,  each  DUAL  page  is  implemented  using 
two  single  pages.  DUAL  pages  are  used  for  saving  important  data  structures,  and  as  a  building  block 
for  implementing  more  complex  atomic  operations. 

The  Logical  Disk  Subsystem  provides  the  following  set  of  primitives: 


234 


pagelist  3  DKAIIocate(npages,  type  [,  follows]) 
val  =  DKFree{pagelist) 

npr  s  DKRead(pagelist,  buffer) 
npw  3  DKWritefpagelist,  buffer) 

val  =  OKSyncO 

val  3  DKStatus(statusbuffer) 

val  s  OKRestart(clev-name) 
val  3  OKRecoverOual(pagelist)’ 

val  3  OKUnmarkO 
val  3  OKMark(pagelist) 
val  3  OKGoodO 
val  3  OKBad(pageljst) 


PAGELIST  pagelist  List  of  32-bit  logical  disk  page  numbers,  where  the  first  8  bits  of  a  page  number 
are  used  for  various  flags,  and  the  remaining  24  bits  are  converted  to  the  cylinder, 
track,  and  sector  numbers  of  the  physical  disk.  The  first  flag  bit  indicates  whether 
the  page  is  DUAL  or  NORMAL. 

BOOLEAN  val  TRUE  if  the  specified  operation  is  done  successfully:  otherwise  FALSE. 


INT  npr 
INT  npw 
INT  npages 
PAGETYPE  type 


The  actual  number  of  pages  read. 

The  actual  number  of  pages  written. 

The  number  of  pages  to  be  allocated. 

The  type  of  pages  to  be  allocated:  NORMAL  or  DUAL. 

I 


PAGEID  follows 

BUFFER  'buffer 


Logical  disk  page  number,  following  which  the  new  pages  are  to  be  allocated 
(physically  as  close  as  possible). 

The  buffer  address  in  the  kernel  address  space  of  either  the  local  or  remote  node. 


DKSTATUSBUFFER  'statusbuffer 

The  buffer  address  for  returning  status  information  (can  be  either  a  local  or 
remote  kernel  address). 

DEVNAME  dev-name 

The  logical  device  name. 


235 


The  DKAllocate  primitive  allows  a  number  of  pages  to  be  allocated  at  a  time,  and  it  returns  a  list  of 
the  allocated  logical  disk  page  numbers.  An  attempt  is  made  to  allocate  pages  which  are  physically 
close  together  on  disk.  The  pages  can  be  of  type  NORMAL  or  DUAL,  and  a  preferred  location  on  disk 
(near  an  existing  page)  can  be  specified.  The  DKFree  primitive  deallocates  the  specified  list  of  pages. 

DKRead  reads  a  list  of  pages  into  the  specified  kernel  buffer,  and  returns  the  count  of  pages 
actually  read.  Similarly,  DKWrite  writes  a  list  of  pages  from  the  specified  kernel  buffer,  and  returns  the 
count  of  pages  actually  written.  When  the  buffer  is  in  a  remote  kernel,  DKRead  and  DKWrite 
automatically  handle  the  cross  node  copying  of  data. 

The  DKSync  primitive  ensures  that  any  buffered  information  within  the  Logical  Disk  Subsystem  is 
consistent  (synchronized)  with  the  version  on  disk.  The  DKStatus  primitive  provides  information 
about  disk  utilization,  frequency  of  access,  frequency  of  errors,  number  of  bad  blocks,  and  so  on.  It 
can  be  used,  for  example,  by  the  Arobject/Process  Management  Subsystem,  to  determine  the 
placement  of  process  address  space  paging  areas. 

DKRestart  is  used  only  when  the  system  is  restarted.  It  initializes  the  disk  hardware  and  any  internal 
data  structures.  The  OKRecoverDual  primitive  recovers  the  given  list  of  DUAL  pages,  by  ensuring  that 
the  two  copies  of  each  DUAL  page  are  consistent  (see  [Sturgis  80]  for  details).  This  is  used  by 
various  subsystems  when  recovering  their  data  strucUires  after  a  crash. 

The  DKUnmark  and  DKMark  primitives  are  provided  to  help  in  garbage  collection.  First,  DKUnmark 
is  used  to  flag  all  of  the  pages  on  the  disk  (except  bad  pages)  as  free.  Then,  DKMark  is  used 
(repeatedly)  to  flag  the  specified  pages  as  allocated.  Similarly,  DKGood  and  DKBad  are  used  for 
managing  the  bad  pages  (unusable  pages)  on  the  disk.  DKGood  flags  all  of  the  pages  on  the  disk  as 
good.  Then,  DKBad  is  used  to  flag  pages  that  have  been  determined  to  be  bad. 

3. 7. 3.1  Components  of  the  Logical  Disk  Subsystem 

The  Logical  Disk  Subsystem  consists  of  a  single  Manager  process,  and  three  main  data  structures, 
which  are  shown  in  Figure  3-27.  The  Logical  Disk  Manager  (LDM)  process  accepts  all  operations  for 
the  subsystem,  exicutes  them,  and  returns  the  results.  The  three  data  structures  it  uses  are: 
(l)Sorted  Page  List,  (2)Data  Buffer,  and  (3)Page  Allocation  Map  buffer. 

The  purpose  of  the  Sorted  Page  List  is  to  order  the  disk  pages  to  be  read  or  written,  so  as  to  reduce 
disk  head  movement,  and  improve  disk  performance.  Each  entiy  of  the  Sorted  Page  List  consists  of 
three  values:  tfie  logical  page  ID  of  the  starting  page,  the  number  of  pages  {npages)  to  be  read  or 


236 


Physical 

Disk 


(pagelD.  npages,  buffer] 


Page 

i 

Page 

j 

Page 

k 

Sorted  Page 
List 


Data  Buffer  Page  Allocation 

Map  Buffer 


Figure  3-27:  Components  of  the  Logical  Disk  Subsystem 


written  in  one  disk  operation,  and  the  location  within  the  kernel  buffer  to  be  used  for  the  operation. 
The  Logical  Disk  Subsystem  allows  multiple  page  read  and  write  operations.  The  pagelist  provided  is 
clustered  (if  possible)  and  then  entered  in  the  Sorted  Page  List. 

The  Data  Buffer  provided  is  used  only  for  remote  operations  (invoked  from  other  nddes  in  the 
distributed  system).  For  example,  data  read  from  the  disk  (for  a  remote  operation)  is  first  placed  in  the 
Data  Buffer,  and  then  transferred  to  the  destination  buffer  on  the  remote  node,  by  using  the  system 
copy  facility.  A  "double  buffering"  technique  is  used,  so  that  the  disk  transfer  and  copy  operations 
can  proceed  in  parallel. 

The  Page  Allocation  Map  Buffer  is  used  to  buffer  some  of  the  pages  of  the  Page  Allocation  Map. 
Whenever  the  Page  Allocation  Map  has  to  be  read  or  modified  (such  as  in  DKAIlocate,  DKFrae, 
DKMark,  and  DKUnmark),  the  relevant  pages  of  the  map  are  first  read  into  the  buffer  (unless  they  are 
already  buffered).  A  detailed  description  of  the  Page  Allocation  Map  and  its  handling  is  provided  in 
Section  3.7.3.3. 


237 


3. 7. 3. 2  Disk  Layout 

A  logical  disk  is  laid  out  as  a  sequence  of  pages,  which  are  numbered  from  0  to  N-1 ,  where  N  is  the 
total  number  of  pages  on  the  logical  disk.  The  logical  page  number  specified  in  the  logical  disk 
operations  consists  of  the  concatenation  of  a  flags  byte,  with  a  number  from  0  to  N-1.  A  page  can 
either  be  NORMAL  or  DUAL.  A  NORMAL  page  is  a  single  disk  page,  referred  by  a  logical  page 
number.  A  DUAL  page  consists  of  two  pages,  where  the  second  page  is  located  a  fixed  offset  from 
the  first  (primary)  page.  The  DUAL  page  is  referred  to  by  the  logical  page  number  of  the  primary 
page.  The  first  bit  in  the  flags  byte  (of  the  logical  page  number)  indicates  whether  a  page  is  NORMAL 
or  DUAL.  The  two  physical  pages  of  a  DUAL  page  have  a  gap  of  one  page  (pages  are  not  adjacent), 
to  reduce  the  probability  of  both  pages  being  damaged  by  a  single  disk  fault,  and  to  ensure 
reasonably  efficient  sequential  access. 

Logical  page  0  of  the  disk  serves  a  special  function.  It  contains  information  about  the  logical  disk, 
and  pointers  to  all  important  data  structures  saved  on  the  disk.  Thus,  it  saves  information  about  the 
disk  drive  characteristics,  the  logical  disk  name,  the  logical  disk  size,  and  pointers  to  the  Page 
Allocation  Map,  the  Bad  Page  Table,  the  RIe  System  Directory,  and  so  on.  For  reliability,  page  0  is  a 
DUAL  page. 

3. 7. 3. 3  Disk  Page  Allocation 

The  Page  Allocation  Map  is  a  bit  map,  which  has  one  bit  corresponding  to  each  page  on  the  logical 
disk.  If  a  disk  page  has  been  allocated,  its  bit  in  the  allocation  map  is  set.  The  Page  Allocation  Map 
itself  is  saved  near  the  middle  of  the  disk,  to  provide  efficient  access.  All  pages  of  the  allocation  map 
are  DUAL,  and  laid  out  as  shown  in  Rgure  3-28.^® 


Allocation 
Map  Pago  0 

Allocation 
Map  Page  1 

DUAL  of 

DUAL  of 

All ocatlon 
Map  Page  2 

Allocation 
Nap  Page  3 

(Map  for 
pagts  0 
to  16K-t) 

(Nap  for 
pages  ISK 
to  32*- 1) 

Allocation 
Map  Page  0 

Allocation 
Map  Page  1 

(Map  for 
pages  32K 
to  *BK-l) 

(Map  for 
pages  48K 
to  64K-1) 

Figu  re  3-  28:  Layout  of  the  Page  Allocation  Map 


40 

II  vue  have  a  SOO  MB  logical  disK,  it  would  contain  2S0K  pages  of  size  2K. 
bits,  or  16  DUAL  pages  (.12  logical  disk  pagr-s)  approximately. 


The  size  of  the  allocation  m.ap  would  be  250K 


238 


The  page  allocation  algorithm  used  is  fairly  simple.  It  attempts  to  pack  the  allocated  pages,  starting 
at  page  1  of  the  disk,  and  continuing  through  to  page  N-1 .  If  the  last  page  allocated  was  page  /,  the 
following  allocation  will  start  from  page  i  +  1  (unless  a  preferred  allocation  page  is  specified).  The 
advantage  of  this  scheme  is  that  allocation  and  deallocation  are  very  efficient.  For  allocation,  the 
page  of  the  allocation  bitmap  containing  bit  i  +  1  \s  likely  to  already  be  buffered,  since  that  page  was 
probably  used  recently  (for  the  previous  allocation  operation).  Hence,  no  disk  reads  will  usually  be 
required  for  allocation.  For  deallocation,  the  appropriate  allocation  map  page  can  be  easily 
determined,  then  read  in  (if  necessary),  and  modified.  Note  that  with  this  allocation  scheme, 
successively  allocated  pages  will  tend  to  be  located  close  together  on  disk,  thus  improving  access 
efficiency.^ 

3. 7. 3. 4  Bad  Page  Handling 

One  of  the  functions  of  logical  disk  management  is  to  prevent  the  use  of  bad  pages  on  the  disk.  A 
list,  called  the  Bad  Page  List,  is  maintained  near  the  end  of  the  disk,  and  consists  of  the  logical  page 
numbers  of  all  known,  unusable  (bad)  pages.®’  The  Bad  Page  List  is  stored  as  DUAL  pages.  When 
the  Page  Allocation  Map  is  initialized  (using  DKUnmark),  ail  the  bad  pages  are  marked  as  allocated,  to 
prevent  further  usage  of  these  pages. 

The  Bad  Page  List  is  usually  constructed  when  a  new  disk  is  initialized,  and  is  not  expected  to 
change  very  much  after  that.  When  the  Bad  Page  List  is  constructed,  the  Data  Buffer  can  be  used  for 
buffering  it.  The  List  will  be  constructed  by  a  special  utility,  which  first  calls  DKGood  to  remove  the 
old  Bad  Page  List  (if  it  exists).  After  that,  the  entire  disk  is  scanned  to  find  any  bad  pages.  This  is 
done  by  writing  each  page  and  then  reading  it  back.  For  each  bad  page  found,  its  logical  page  ID  is 
entered  in  the  Bad  Page  List  (with  DKBad).  Once  the  system  is  in  operation,  if  a  bad  page  is 
discovered  by  some  subsystem,  it  can  call  DKBad  to  enter  the  page  in  the  Bad  Page  List. 


50 

This  allocation  scheme  will  be  cble  to  cluster  multiple  page  allocations  reasonably  well  when  the  disk  is  new,  but  the  disk 
will  slowly  become  more  and  more  fragmented  with  continued  use.  !i>ome  improvements  could  be  made,  to  reduce  the  amount 
of  fragmentation,  i!  it  is  found  to  have  a  significant  impact  on  performance. 

51 

The  entries  in  the  Bad  Page  List  aro  sorted  when  the  list  is  first  constructed,  but  new  entries  are  simply  added  to  the  end  ot 
tlie  list  as  found.  The  Bad  Page  List  is  expected  to  require  only  a  few  disk  pages  for  storage.  If  1%  of  tlie  itagos  (sire  2KB 
eacti)  of  a  500  MB  disk  ore  bad.  the  Bad  Block  List  would  have  2.5K  entries.  Given  that  each  entry  is  4  bytes  long.  5  DUAL 
p.ages  (10  disk  p.ages)  are  required  to  save  the  entire  list. 


239 


3.7.4  Restart/Reconfiguration  Subsystem 

The  Page  Set  Restart/Reconfiguration  Subsystem  (PSR  Subsystem)  performs  several  functions 
related  to  the  restart  of  a  computation  node  following  a  crash,  as  well  as  the  mounting  and 
unmounting  of  logical  disks.  A  single  instance  of  this  subsystem  resides  at  each  node  in  the  system. 
The  PSR  subsystem  "bootstraps"  the  entire  Page  Set  Subsystem  at  restart  time.  After  a  crash,  the 
Restart/Reconfiguration  Subsystem  is  restarted  automatically,  and  is  responsible  for  constructing  the 
entire  Page  Set  Subsystem.  The  kernel  (on  a  hode)  maintains  a  list  of  device  addresses  which  it 
probes,  to  determine  the  logical  disks  on  its  own  computing  node.  The  PSR  Subsystem  sets  up  the 
atomic  and  standard  page  set  arobjects,  as  well  as  the  logical  device  arobject,  for  each  logical  disk 
found  on  the  node. 

In  addition  to  restarting  the  Page  Set  Subsystem,  the  PSR  Subsystem  adds  and  removes  logical 
disks  dynamically  (while  the  system  is  running)  as  necessary.  When  a  new  logical  disk  is  added,  the 
various  Page  Set  Subsystem  arobjects  (Standard  Page  Set,  Atomic  Page  Set,  and  Logical  Disk)  have 
to  be  created  for  it.  When  a  disk  is  removed,  the  corresponding  arobjects  have  to  be  killed.  Some 
support  is  provided  for  the  clean  unmounting  of  disks.  It  is  possible  to  quieten  activity  on  a  disk  (by 
allowing  old  activity  to  complete,  but  refusing  new  activity),  prior  to  disk  removal. 

The  PSR  Subsystem  maintains  a  global  Mount  Table,  which  indicates  the  location  (node)  of  all  the 
logical  disks  in  the  entire  distributed  system,  and  the  Page  Set  Subsystem  Arobjects  associated  with 
each  logical  disk.  Each  instance  of  the  PSR  Subsystem  maintains  a  copy  of  the  Mount  Table,  and  the 
peers  co-operate  in  keeping  up-to-date  versions  of  the  table.  The  Mount  Table  at  a  node  is  globally 
accessible  to  other  kernel  level  subsystems  on  that  node.  Along  with  the  global  Mount  Table,  a  local 
Unmount  Table  is  also  maintained.  The  Unmount  Table  lists  the  logical  disks  accessible  to  the  PSR 
Subsystem,  but  not  globally  visible  to  other  subsystems  on  its  node,  or  to  other  nodes  in  the 
distributed  system. 

The  PSR  Subsystem  performs  two  other  useful  functions.  Firstly,  it  co-ordinates  garbage  collection 
for  the  entire  Page  Set  Subsystem  (on  a  node).  Secondly,  it  checks  any  given  disk  for  bad  (unusable) 
pages,  and  enters  them  in  a  Bad  Page  List  (refer  Section  3.7.3).  Each  of  these  two  functions  can  be 
performed  at  restart,  as  well  as  when  the  system  is  running. 

The  PSR  Subsystem  allows  a  logical  disk  to  be  in  one  of  four  possible  states:  Mounted,  Unmounted, 
Quiet,  or  Removed  (refer  Figure  3-29).  In  the  Mounted  state,  the  logical  disk  has  an  entry  in  the  Mount 
Table,  and  is  globally  visible  on  its  own  node,  as  well  as  throughout  the  system.  All  types  of  disk 
operations  (except  disk  checking  and  garbage  collection),  such  as  paging,  reading  and  writing,  are 


240 


PSRMount  PSftInl tOtvIct 


Figure  3*29:  State  Diagram  for  Logical  Disks 


permitted  in  this  state.  The  Quiet  state  is  very  similar  to  the  Mounted  state  in  having  an  entry  in  the 
Mount  Table,  and  being  globally  visible.  The  only  difference  is  that  this  state  represents  an  attempt  to 
quiet  down  system  activity  prior  to  disk  removal,  and  hence  new  activity  is  not  allowed.  In  the 
Unmounted  state,  the  logical  disk  has  an  entry  only  in  the  Unmount  Table,  and  is  visible  within  the 
PSR  Subsystem.  The  main  purpose  of  this  state  is  to  allow  disk  checking,  garbage  collection,  and 
other  disk  maintenance  to  be  performed,  while  the  rest  of  the  system  is  still  up  and  running.  The 
Removed  state  corresponds  to  the  physical  removal  of  the  disk  from  the  computer  system. 


val  >  PSRRestart(device-list  [,  options]) 

diskid  3  PSRInitDevicefdev-name  [,  initflags])  ^ 

val  3  PSRRemoveDiskfdisKid) 

val  3  PSRMount(diskid) 

val  3  PSRUnmount(diskid) 

val  3  PSROuiet(diskid) 

val  3  PSRCheckDisk(diskid) 
val  3  PSRGarbageCollect(diskid) 

val  3  PSRInsertMount(diskid.  node,  disk-aid,  ps-aid,  aps-aid,  mtflags) 

val  3  PSRRemoveMount(diskid) 

val  3  PSRRequestMount(mount-table) 


BOOLEAN  val 


TRUE  if  the  specified  operation  is  done  successfully;  otherwise  FALSE. 


241 


OISKID  diskid  The  Logical  Disk  ID  of  the  disk*volume  being  operated  on. 

DEVLIST  device-list 

The  list  of  logical  device  names  corresponding  to  logical  disks  in  the  system. 
PSROPTIONS  options 

A  set  of  boolean  flags  which  can  be  specified  as  restart  options  for  operations  to 
be  carried  out  in  the  restart  sequence.  The  flags  provided  are:  GARBAGE- 
COLLECT,  and  CHECKDISK. 

DEVNAM3  dev-name 

The  logical  device  name. 

PSRINITFLAGS  initflags 

A  set  of  boolean  flags  which  can  be  specified  when  a  logical  disk  is  initialized. 
Currently,  two  flags  are  provided:  READ-ONLY,  and  SELF-CONTAINED. 

NODENAME  node  The  ID  of  the  node  on  which  the  disk  is  mounted. 

AID  disk-aid  The  arobject  ID  for  the  Logical  Disk  Subsystem  Arobject  corresponding  to  the 

disk  diskid. 

AID  ps-aid  The  arobject  ID  for  the  Standard  Page  Set  Subsystem  Arobject  corresponding  to 

the  disk  diskid. 

AID  aps-aid  The  arobject  ID  for  the  Atomic  Page  Set  Subsystem  Arobject  corresponding  to 

the  disk  diskid. 

PSRMOUNTFLAGS  mtflags 

A  set  of  boolean  flags  associated  with  each  entry  in  the  Mount  Table.  The  flags 
provided  are:  QUIET,  READ-ONLY,  SELF-CONTAINED. 

MOUNTBUFFER  'mount-table 

The  buffer  used  for  transferring  the  contents  of  the  mount  table  between  different 
nodes  in  the  system. 


The  PSRRestart  primitive  restarts  the  Page  Set  Subsystem  on  a  given  node,  following  a  node  crash. 
It  is  invoked  by  a  kernel  process  responsible  for  bringing  up  the  entire  kernel  on  that  node.  The  list  of 
devices  corresponding  to  all  logical  disks  on  the  node  is  determined  by  the  kernel  process,  and 
supplied  as  a  PSRRestart  parameter.  For  each  logical  disk  on  the  node,  the  PSRRestart  primitive 
creates  and  restarts  three  page  set  subsystem  arobjects  (standard  page  set,  atomic  page  set,  and 
logical  disk).  In  addition,  if  the  GARBAGE-COLLECT  and/or  CHECKDISK  options  are  specified,  then 
garbage  collection  and/or  disk  checking  (for  bad  pages)  are  performed  as  part  of  tfie  page  set  restart 
sequence. 


242 


The  PSRinitDevice  primitive  adds  a  new  logical  disk  to  the  node.  The  state  of  the  logical  disk  makes 
a  trasition  from  the  Removed  state  to  the  Unmounted  state.  The  system  device  name  (dev-name)  for 
the  logical  disk  is  supplied,  and  its  logical  ID  is  found  (from  page  0  of  the  disk)  and  returned.  The 
Standard  Page  Set,  Atomic  Page  Set,  and  Logical  Disk  arobjects  are  created  for  the  newly  added 
disk.  Optional  flags  (READ-ONLY,  and  SELF-CONTAINED)  can  be  specified  as  necessary,  for 
restricting  the  type  of  disk  access  permitted.  The  READ-ONLY  flag  specifies  that  the  entire  disk  is 
read-only,  and  its  contents  cannot  be  modified.  The  SELF-CONTAINED  flag  specifies  that  all  files 
saved  on  that  disk  must  have  the  directory  entries,  as  well  as  the  file  contents  (page  sets)  saved  on 
that  disk.^^  The  PSRRemoveDisk  primitive  removes  a  specified  unmounted  logical  disk  from  the 
system.  The  three  page  set  arobjects  for  the  logical  disk  are  destroyed. 

The  PSRMount  primitive  mounts  the  specified  unmounted  logical  disk,  and  makes  it  globally 
accessible.  All  nodes  in  the  distributed  system  are  informed  about  the  newly  mounted  di.sk.  The 
PSRUnmount  primitive  unmounts  the  specified  logical  disk,  from  the  Mounted  or  Quiet  state.  In  either 
case,  the  result  of  a  PSRUnmount  is  seen  by  the  rest  of  the  system  as  a  disk  creish,  where  no  further 
operations  are  permitted,  and  ongoing  operations  cannot  be  completed.®^  It  is  expected,  however, 
that  if  the  disk  was  quiet  prior  to  unmounting,  there  would  be  very  little  (if  any)  ongoing  activity  at  this 
time.  The  PSRQuiet  primitive  flags  a  mounted  logical  disk  as  being  in  a  special  quiet  state,  in  which 
ongoing  activity  can  be  continued,  but  new  activity  is  not  allowed.  The  main  propose  of  this  primitive 
is  to  allow  for  clean  disk  unmounting. 

The  PSRCheckOisk  and  PSRGarbageCollect  primitives  are  allowed  only  when  the  specified  logical 
disk  is  in  the  Unmounted  state.  The  PSRCheckDisk  primitive  determines  the  bad  pages  on  the 
specified  disk  (non-destructively),  and  constructs  the  Bad  Page  List.  The  PSRGarbageCollect 
primitive  co-ordinates  garbage  collection  for  the  specified  disk,  by  invoking  garbage  collection 
primitives  provided  by  the  standard  page  set,  atomic  page  set,  and  logical  disk  arobjects. 

The  Mount  Table  is  replicated  on  all  nodes  in  the  distributed  system.  Three  primitives  are  provided, 
which  are  used  only  by  peer  Restart/Reconfiguration  arobjects  on  other  nodes,  for  obtaining  and 
updating  information  from  the  Mount  Table.  The  PSRInsertMount  primitive  is  used  for  adding  a  new 
Mount  Table  entry,  or  for  modifying  an  existing  entry.  The  parameters  specified  are  the  logical  disk  ID 
{diskid),  the  ID  of  the  node  on  which  the  disk  is  mounted  (node),  the  IDs  of  the  three  Page  Set 

52 

If  a  disk  is  self  contained,  it  can  be  easily  moved  from  one  ArchOS  system,  and  initialized  on  another. 

53 

PSnUnmour)t  can  be  implemented  as  the  de.struction  of  the  three  page  set  arobjects  (standard  pano  set,  atomic  page  set, 
and  logical  disk),  followed  by  the  creation  oi  new  page  set  arobjects.  In  this  case,  all  rcc|uests  for  operations  with  the  otd  page 
set  arohiect  IDs  will  return  error  indications. 


243 


Subsystem  arobjects:  logical  disk  subsystem  (disK-aid),  standard  page  set  (ps-aid),  and  atomic  page 
set  (aps-aid),  and  a  set  of  flags.  The  PSRRemoveMouni  primitive  removes  the  Mount  Table  entry  for 
the  logical  disk.  Finally,  the  PSRRequestMount  primitive  returns  the  contents  of  the  entire  Mount 
Table  in  the  buffer  provided.  This  primitive  is  primarily  invoked  by  a  peer  PSR  arobject  which  is 
restarting  after  a  crash,  and  attempting  to  reconstuct  its  copy  of  the  system-wide  Mount  Table. 

3. 7. 4.1  Components  of  the  Restart/Reconfiguration  Subsystem 
The  PSR  Subsystem  consists  of  four  types  of  components:  (1)the  Mount  Table,  (2)the  Unmount 
Table,  (3)the  PSR  Manager  process,  and  (4)one  or  more  PSR  Worker  processes.  Each  of  these 
components  is  briefly  described  below,  and  shown  in  Figure  3-30. 


Mount  Tnbln 


Logicnl 

Nodn 

Disk 

PS 

APS 

FLAGS  1 

Disk  ID 

ID 

AID 

AID  { 

AID 

Unmount  Tnbln 


Logicnl 

Disk 

PS 

APS 

FLAGS 

Dlik  ID 

AID 

AID 

AID 

Figure  3-30:  Components  of  the  Restart/Reconfiguration  Subsystem 


The  Mount  Table  provides  information  about  the  location  of  all  {mounted  and  Quiet)  logical  disks  in 
the  entire  distributed  system,  and  the  arobjects  which  are  responsible  for  managing  these  disks.  A 
copy  of  the  Mount  Table  is  maintained  at  each  node  in  the  system.  The  Mount  Table  is  globally  visible 


244 


to  all  subsVstems  on  a  node  (and  to  other  nodes),  but  is  updated  only  by  the  PSR  Subsystem.  During 
updates,  the  table  is  locked,  to  prevent  the  access  of  inconsistent  states  of  the  table.  Each  entry  of 
the  Mount  Table  contains  the  following  items  of  information:  the  ID  of  the  logical  {mounted  or  quiet 
disk,  the  ID  of  the  node  on  which  it  is  mounted,  the  ID  of  the  Logical  Disk  Subsystem  arobject  for  that 
disk,  the  ID  of  the  Standard  Page  Set  arobject  for  that  disk,  the  ID  of  the  Atomic  Page  Set  Arobject  for 
that  disk,  and  a  set  of  flags  (QUIET,  READ-ONLY,  and  SELF-CONTAINED).  The  OUIET  flag  is  the  only 
way  of  specifying  whether  a  logical  disk  is  in  the  mounted  state,  or  in  the  quiet  state.  The  READ¬ 
ONLY  and  SELF-CONTAINED  flags  specify  particular  types  of  disk  usage,  and  has  been  discussed  in 
the  previous  section. 

The  Unmount  Table  provides  information  about  unmounted  logical  disks,  which  can  be  accessed  by 
the  PSR  Subsystem,  but  are  not  visible  to  the  rest  of  the  system.  It  is  useful  to  be  able  to  access  disks 
in  this  mode,  primarily  for  maintenance  purposes,  such  as  disk  checking  and  garbage  collection.  The 
entries  in  the  Unmount  Table  are  very  similar  to  those  in  the  Mount  Table,  except  that  the  ID  of  the 
node  is  not  required,  since  the  Unmount  Table  refers  only  to  its  local  node.  In  the  flags  field,  the 
QUIET  flag  is  not  required  for  this  table. 

The  PSR  Manager  process  is  responsible  for  providing  most  of  the  functionality  of  the  PSR 
Subsystem.  It  accepts  all  PSR  operations,  and  returns  the  results.  It  also  manages  the  Mount  and  the 
Unmount  tables.  The  PSR  Worker  processes  serve  two  functions.  They  implement  PSRCheckDisk 
and  PSRGarbageCollect  operations  on  behalf  of  the  Manager.  Since  both  these  operations  can 
potentially  tal^e  a  very  long  time,  the  PSR  Subsystem  can  still  be  used  for  performing  operations  on 
other  disks  concurrently.  The  other  function  of  the  Worker  process  is  to  wait  on  behalf  of  the 
Manager  process,  when  peer  level  operations  are  invoked  by  the  Manager.  This  prevents  deadlocks 
arising  from  multiple,  concurrent  peer  operations.®*  One  Worker  process  is  always  present  in  the 
PSR  Subsystem  for  invoking  peer  PSR  operations  (PSRInsertMount,  PSRRemoveMount,  and 
PSRRequestMount).  For  each  PSRcheckDisk  or  PSRGarbageCollect  operation,  a  new  Worker 
process  is  created,  and  then  destroyed  when  the  operation  is  completed. 

3.7.4.2  Restart 

The  PSR  Subsystem  performs  two  related  functions  when  restarting  after  a  node  crash.  The  Mount 
Table  has  to  be  recreated,  and  the  Page  Set  Subsystem  has  to  be  brought  up.  The  reconstruction  of 
the  Mount  Table  is  quite  similar  to  the  reconstruction  of  the  Directory  Map  Table,  when  the  Directory 
Map  Subsystem  of  the  File  System  is  restarted  (refer  Section  3.6.3).  The  PSR  Manager  process 


54 


Note  the  similarity  with  the  handling  of  the  Directory  Map  Table  in  the  File  Subsystem  in  Section  3.6.3. 


245 


invokes  a  Worker  process  to  obtain  the  Mount  Table  (with  PSRRequestMount)  from  one  of  the  other 
nodes  in  the  system.  In  the  meantime,  it  proceeds  to  initialize  all  the  logical  disks  on  its  own  node, 
which  have  been  found  by  the  kernel  restart  process.  It  determines  the  IDs  of  the  logical  disks,  and 
mounts  them. 

For  each  logical  disk  on  the  node,  its  ID  is  first  determined,  and  an  entry  made  in  the  Unmount 
Table.  Next,  a  Logical  Disk  Subsystem  arobject  is  created  for  that  disk,  and  then  restarted.  As  a 
result  of  the  Disk  Subsystem  restart,  all  the  logical  disk  data  structures  are  recovered  (page  0  of  the 
disk,  the  allocation  map,  and  the  bad  page  list).  Once  this  operation  has  completed,  the  standard 
page  set  arobject  is  created  and  restarted,  so  that  the  Page  Set  B-tree  is  recovered  correctly.  Next, 
the  atomic  page  set  arobject  is  created  and  restarted.  This  allows  all  its  data  structures  to  be 
recovered,  and  ail  committed  transactions  to  be  completed.  The  IDs  for  the  three  newly  created 
arobjects  are  entered  in  the  Unmount  Table.  Next,  the  logical  disk  is  mounted,  and  entered  in  the 
Mount  Table.  Once  all  the  logical  disks  are  thus  mounted,  all  the  peer  PSR  arobjects  are  told  to 
update  their  Mount  Tables,  using  the  PSRInsertMount  primitive. 

If  the  disk  checking  and  garbage  collection  options  are  specified,  then  these  operations  also  have 
to  be  performed  as  part  of  the  restart  sequence.  Garbage  collection  is  a  good  idea  after  a  restart,  so 
that  all  temporary  page  sets  can  be  flushed. 

3. 7. 4. 3  Garbage  Collection  and  Disk  Checking 

The  PSR  Subsystem  co-ordinates  garbage  collection  for  the  Page  Set  Subsystem.  It  first  calls  on 
the  DKUnmark  primitive,  so  that  the  Logical  Disk  Subsystem  marks  the  entire  allocation  map  as  being 
free.  Next,  it  calls  the  PSGarbageCollect  primitive,  which  specifies  all  the  pages  being  used  by  the 
Standard  Page  Set  Subsystem.  Following  that,  the  APSGarbageCollect  primitive  is  invoked,  which 
marks  all  the  disk  pages  in  use  by  the  Atomic  Page  Set  Subsystem.  As  a  result  of  this  garbage 
collection  procedure,  ail  temporary  page  sets,  as  well  as  pages  not  in  use  by  the  Page  Set  Subsystem 
are  freed.®® 

Disk  checking  determines  all  the  unusable  (bad)  pages  on  the  specified  disk  in  a  non-destructive 
manner.  First,  the  DKGood  primitive  is  called  to  remove  the  old  Bad  Page  List,  thereby  marking  all 
disk  pages  as  "good".  Then  each  page  on  the  disk  is  read.  If  there  is  no  checksum  error  in  reading, 
then  the  page  is  usable.  If  there  is  an  error,  the  page  is  written  into,  and  then  read  again.  If  the  error 
persists,  then  the  page  number  is  entered  in  the  Bad  Page  List.  All  the  pages  of  tfie  disk  are  checked 
in  this  way. 

55 

It  is  .stm  possible  for  unreferenced  files  to  exist,  since  this  garbage  collection  procedure  does  not  consider  subsystems 
above  the  Page  Set  Subsystem. 


246 


3.8  I/O  Device  Subsystem 

An  I/O  device  is  treated  as  a  special  type  of  file  where  no  transaction  is  allowed  and  a  device 
specific  control  scheme  is  included.  To  read,  write  or  issue  a  special  command  such  as  reset,  a 
device  must  be  opened  before  any  read/write  access  and  it  has  to  be  closed  after  all  of  the  actions 
are  completed. 

All  of  the  device  dependent  commands  can  be  sent  to  devices  by  using  the  SetloctI  primitive. 

dd  s  OpenOevice(devicename,  mode) 

CloseOevice(dd) 

nr  =  ReadDevice(dd,  buf,  nbytes) 
nw  =  WriteDevice(dd,  buf,  nbytes) 

event-cnt  =  lowait(dev-descriptor,  timeout) 

val  =  Setloctl(dev-descriptor,  io-command,  dev-buf,  timeout) 


It  should  be  noted  that  the  current  I/O  system  does  not  interact  with  the  transaction  manager  at  all. 
Thus,  any  type  of  transaction  facility  is  provided. 


3.9  Time-Driven  Scheduler  Subsystem 

Process  scheduling  in  a  real-time  facility  is  crucial  to  the  success  of  the  system.  Process 
scheduling  in  ArchOS  differs  fundamentally  from  scheduling  on  other  operating  systems  because  of 
several  critical  factors: 

•  ArchOS  is  a  real-time  system;  hence  process  scheduling  must  be  compatible  with  its 
critical  goal  of  supporting  application-defined  time  constraints 

•  ArchOS  design  manages  application  time-constraints  by  explicitly  accounting  for  the  fact 
that  there  is  a  definable  time-varying  value  to  the  system  for  completing  each  process  at  a 
particular  time. 


3.9.1  Best-Effort  Scheduling 


247 


3. 9. 1.1  Value  Function  Processing 

The  ArchOS  scheduler,  part  of  the  ArchOS  Kernel  Arobject  (see  Section  3.3.1  designed  to  explicitly 
use  the  application  defined  value  function  for  the  set  of  processes  in  its  queue  in  making  a  "best- 
effort”  decision  on  the  process  to  be  executed  at  each  point  in  time. 

The  computational  model  for  this  scheduler  consists  of  a  set  of  schedulable  processes  p.  resident  in 
a  processor.  Each  such  proce^  has  a  request  time  R^,  an  estimated  computation  interval  C|  and  a 
value  function  Vj(t),  where  t  is  a  time  for  which  the  value  is  to  be  determined.  Figure  3-31  illustrates 
these  process  attributes  for  a  process  with  a  linearly  decreasing  value  function  prior  to  a  critical  time 
Dj,  and  an  exponential  value  decay  following  the  critical  time.  The  illustration  depicts  a  process  which 
is  dispatched  after  its  request  time  and  which  completes  prior  to  its  critical  time  without  being 
preempted. 


Figu  re  3-  3 1 :  Process  Model  Attributes  for  Process  i 

V.(t)  defines  the  value  to  the  system  due  to  completing  p.  at  tfme  t.  The  set  of  these  value  functions 
is  used  by  the  scheduler  to  determine  the  best  sequence  in  which  to  schedule  each  of  the  available 
processes.  The  type  of  functions  definable  for  Vj  will  determine  the  range  of  scheduling  policies 
supportable  by  the  operating  system,  particularly  with  respect  to  the  handling  of  a  processor  overload 
in  which  some  critical  times  cannot  be  met. 

We  note  that  the  existence  and  importance  of  the  deadline  is  dependent  on  the  value  function.  The 
value  function  can  be  said  to  define  an  explicit  deadline  only  if  it  has  a  discontinuity  in  the  function,  its 
first  derivative,  or  its  second  derivative,  in  which  the  value  is  lower  or  decreasing  after  the 


248 


discontinuity.  In  Figure  3-31 ,  the  deadline  is  defined  by  the  discontinuity  in  its  first  derivative  at  the 
critical  time  D.. 

At  any  particular  point  in  time,  there  will  be  n  processes  ready  for  scheduling,  resulting  in  nl 
possible  scheduling  sequences.  Each  of  these  sequences  consists  of  a  process  ordering 
(m^,  •  •  ,m^),  where  p^^  would  be  the  process  to  be  scheduled.  A  scheduling  sequence  will  be 
considered  optimal  if,  with  respect  to  the  available  information  at  the  time  of  the  scheduling  decision, 
j8  is  maximized,  where  /S  =  2  actual  completion  time  of  P|  using  this  scheduling 

sequence  (i.e.,  if  Pj  is  the  process  to  be  scheduled,  then  )•  See  Figure  3-32  for  an 

example  of  four  processes  with  value  functions,  for  which  the  choice  of  a  best  effort  schedule  is 
non-trivial.  The  figure  shows  the  four  value  functions,  and  a  potential  scheduling  sequence  such  that 
each  completes  with  a  high  value. 


Figu  re  3*32;  Four  "Typical"  Processes  with  their  Value  Functions 


Since  the  completion  times  used  for  this  scheduler  will  be  known  only  stochastically,  the  assumed 
distribution  and  its  computed  parameters  (e.g.,  mean,  variance)  will  be  used  to  compute  its  expected 
value  in  the  scheduling  sequence,  resulting  in  a  statistically  "good"  sequence.  Making  an  optimum 
schedule  can  be  shown  to  be  computationally  intractable,  so  the  best-effort  scheduler  will  use  a  set  of 
heuristics  to  determine  a  sequence  which  will  generate  a  high  total  value  over  any  time  period  which 
is  sufficiently  long  with  respect  to  the  completion  times  of  the  processes  to  be  scheduled. 


249 


3. 9. 1.2  Well-Known  Scheduling  Algorithms 

Scheduling  a  set  of  processes  consists  of  producing  a  sequence  of  processes  on  one  or  more 
processors  such  that  the  utilization  of  resources  optimizes  some  scheduling  criterion.  Criteria  which 
have  historically  been  used  to  generate  process  schedules  includes  maximizing  process  flow  (i.e., 
minimizing  the  elapsed  time  for  the  entire  sequence),  or  minimizing  the  mau<imum  lateness  (lateness  is 
defined  to  be  the  difference  between  the  time  a  process  is  completed  and  its  deadline).  It  has  long 
been  known  [Conway  67]  that  there  are  simple  algorithms  which  will  optimize  certain  such  criteria 
under  certain  conditions,  but  algorithms  optimizing  most  of  the  interesting  scheduling  criteria  are 
known  to  be  NP  complete  [Garey  79],  indicating  that  there  is  no  known  efficient  algorithm  which  can 
produce  an  optimum  sequence.  Clearly,  the  choice  of  metric  is  crucial  to  the  generation  of  a 
processing  sequence  which  will  meet  the  goats  of  the  system  for  which  the  schedule  is  being 
prepared. 


There  are  several  well-known  scheduling  algorithms  which  have  traditionally  been  used  in  process 
scheduling,  each  with  properties  which  make  It  useful  for  certain  applications.  Among  these  are: 

1.  SPT.  At  each  decision  point,  the  process  with  the  shortest  estimated  completion  time  is 
executed.  This  algorithm  maximizes  overall  throughput,  and  is  frequently  used  (although 
in  modified  form)  in  batch  systems. 

2.  Deadline.  At  each  decision  point,  the  process  with  the  earliest  deadline  is  executed. 

This  algorithm,  in  the  absence  of  an  overload,  will  result  in  meeting  all  deadlines.  More 
precisely,  this  algorithm  will  minimize  the  maximum  lateness. 

3.  Slack.  At  each  decision  point,  the  process  with  the  smallest  estimated  slack  time 
(elapsed  time  to  the  critical  time  minus  its  estimated  completion  time)  is  executed.  This 
algorithm,  in  the  absence  of  an  overload,  will  also  result  in  meeting  all  deadlines,  but  will 
produce  a  much  higher  level  of  preemptions.  This  algorithm  will  maximize  the  minimum 
lateness. 

4.  FIFO.  At  each  decision  point,  the  process  which  has  been  in  the  request  set  longest  is 
executed.  This  algorithm  will  produce  a  relatively  "fair"  schedule,  in  which  lateness  will 
be  spread  out  to  ail  processes. 

b.  Priority.  At  each  decision  point,  the  process  with  the  highest  fixed  priority  is  executed. 

The  "most  important"  process  is  executing  at  any  moment. 


Of  these,  real-time  systems  traditionally  use  only  the  Priority  and/or  FIFO  schedulers.  It  should  be 
noted  that  no  objective  performance  measures  (e.g.,  meeting  deadlines,  high  throughput,  low 
lateness  or  tardiness)  are  even  approximately  optimized  by  these  algorithms,  but  they  are  inexpensive 
to  implement  and  require  very  few  run-time  resources. 


250 


In  priority-driven  scheduler  systems,  deadline  management  is  attempted  by  assigning  a  high  fixed 
priority  to  processes  with  "important”  deadlines,  disregarding  the  resulting  impact  to  less 
"important"  deadlines.  During  the  testing  period,  these  priorities  are  (usually  manually)  adjusted  until 
the  system  implementer  is  convinced  that  the  system  "works".  This  approach  can  work  only  for 
relatively  simple  systems,  since  the  fixed  priorities  do  not  reflect  einy  time-varying  value  of  the 
computations  with  respect  to  the  problem  being  solved,  nor  do  they  reflect  fact  that  there  are  many 
scheduiable  sets  of  process  deadlines  which  cannot  be  met  with  fixed  priorities.  In  addition, 
implementers  of  such  systems  find  that  it  is  extremely  difficult  to  determine  reasonable  priorities, 
since,  typically,  each  individual  subsystem  implementer  feels  that  his  or  her  prograpn  is  of  high 
importance  to  the  system.  This  problem  is  usually  "solved"  by  deferring  final  priority  determination  to 
the  system  test  phase  of  implementation,  so  the  resulting  performance  problems  remain  hidden  until  it 
is  too  late  to  consider  the  most  effective  design  solutions. 

Using  a  deadline  scheduler(i.e.,  a  scheduler  which  schedules  the  process  with  the  closest  deadline 
first  [Liu  73])  solves  the  problem  of  missing  otherwise  scheduiable  deadlines  due  to  the  imposit'on  of 
fixed  priorities,  but  leaves  other  problems,  most  notably  the  problem  of  the  transient  overload.  The 
deadline  scheduler  provides  no  reasonable  control  of  the  choice  of  which  deadlines  must  be  delayed 
in  an  overload,  leading  to  unpredictable  failures  and  resulting  in  an  impact  on  reliability  and 
maintainability. 

3.9.1 .3  A  Best-Effort  Scheduling  Algorithm 

The  creation  of  a  best-effort  scheduling  algorithm  is  one  of  the  key  research  interests  of  the 
Archons  project,  and  work  on  it  is  currently  underway  at  this  writing.  However,  there  are  some 
preliminary  results  which  have  been  produced,  and  which  will  be  used,  at  least  in  our  initial  scheduler 
design.  In  this  initial  algorithm,  we  take  advantage  of  several  observed  value  function  and  scheduling 
characteristics: 

•  Given  a  set  of  processes  with  deadlines  which  can  all  be  met  based  on  the  sequence  of 
the  deadlines  and  the  computation  times  of  the  processes,  it  can  be  shown  that  a 
schedule  in  which  the  process  with  the  earliest  deadline  is  scheduled  first  (i.e.,  a  Deadline 
schedule)  will  always  result  in  meeting  all  deadlines. 

•  Given  a  set  of  processes  (ignoring  deadlines)  with  known  values  for  completing  them,  it 
can  be  shown  that  a  schedule  in  which  the  process  with  the  highest  value  density  {V/C,  in 
which  V)  is  its  value  and  C)  is  its  processing  time)  is  processed  first  will  produce  a  total 
value  at  any  period  of  time  no  lower  than  any  other  schedule. 

•  Most  value  functions  of  interest  (at  least  among  those  investigated  at  this  time)  have  their 
highest  value  occurring  immediately  prior  to  the  critical  time. 


251 


Some  of  the  implications  of  these  observations  are: 

•  If  no  overload  occurs,  all  deadlines  will  be  met,  and  the  value  function  produced  by  the 
deadline  schedule  will  be  as  high  as  possible. 

•  If  an  overload  occurs,  and  some  processes  must  miss  their  deadlines,  the  Value  Density 
Schedule  would  produce  a  high  value. 

•  Therefore,  if  we  can  predict  the  probability  of  an  overload,  we  can  choose  processes  with 
low  value  density  as  candidates  for  bein^  removed  from  a  deadline  schedule,  until  a 
deadline  schedule  is  produced  which  has  an  acceptably  low  probability  of  producing  an 
overload. 


The  algorithm  we  will  use,  then,  will  start  with  a  deadline-ordered  sequence  of  the  available 
processes,  which  will  be  sequentially  checked  for  its  probability  of  overload.  At  any  point  in  the 
sequence  in  which  the  overload  probability  passes  a  preset  point,  the  process  prior  to  the  overload 
with  the  lowest  value  density  wilt  be  removed  from  the  sequence,  repeating  until  the  overload 
probability  is  acceptable. 


3.9.2  Time  Management 

The  ArchOS  scheduler  will  use  information  provided  by  the  time  management  primitives  to  make  its 
scheduling  decisions.  This  information  is  derived  from  these  primitives: 

rtc  =  GetRealTimeO 
Timedateftime,  date) 

TimeDate  =  Delay(deltime,  criticaltime,  comptime,  setflag) 

TimeDate  =  Alarm(alarmtime,  criticaltime,  comptime,  setflag) 

val  =  SetTimeDate(time,  date) 


In  addition  to  these  primitives,  policy  directives  will  be  used  which  are  provided  by  the  Policy 
module  (see  Section  3.2.2.3). 


The  information  needed  for  making  the  scheduling  decisions  includes: 

1.  The  request  time.  This  is  the  time  a  process  becomes  available  for  execution,  although 
its  value  Function  may  be  negative,  forcing  the  scheduler  to  delay  it  until  it  can  be 
completed  with  as  high  a  positive  value  as  possible.  Its  value  function  is  defined  by  the 
process  itself  using  the  delay  or  alarm  primitives. 

2.  The  critical  time.  This  time  is  the  time  relative  to  the  request  time  for  the  process  at 
which  the  value  function  may  have  a  discontinuity,  and  is  determined  v/hen  the  process 


252 


was  requested.  Its  value  relative  to  the  request  time  is  set  by  the  process  itself  using  the 
delay  or  alarm  primitives. 

3.  The  value  function  parameters  themselves.  The  value  function  is  divided  into  two 
parts:  (1 )  the  value  for  completing  the  process  prior  to  its  critical  time,  and  (2)  the  value 
for  completing  the  process  after  its  critical  time.  In  each  of  these  functions,  the  time  used 
in  computing  the  function  is  relative  to  the  critical  time,  so  the  value  function  used  prior  to 
the  critical  time  measures  its  time  "backwards".  Each  function  has  the  form;  K(/  = 

+K2t-K2t^  + K^e  '^s*)  so  there  are  ten  parameters  defining  the  total  value  function. 
Value  function  parameters  are  defined  by  the  application  programmer,  but  are  modified 
by  the  system  policy  currently  in  effect.  In  the  case  of  processes  (i.e.,  server  processes) 
scheduled  on  behalf  of  other  processes  (i.e.,  client  processes),  the  value  function  can,  at 
the  option  of  the  process,  use  either  the  value  function  of  its  client  or  its  own  value 
function. 

4.  System  policy.  The  system  policy  is  set  by  the  policy  module,  and  consists  of  a  set  of 

specific  policy  choices  (e.g.,  abort  processes  whose  value  functions  have  fallen  below 
zero)  as  well  as  two  parameters  for  each  process  in  the  system.  The  values  are  and 
K^,  and  are  used  to  make  a  linear  transformation  to  the  application-defined  value 
function:  P(/  =  +  KjjV(t))  It  is  V‘(t)  which  is  actually  used  to  perform  the  value  function 

scheduling  computation. 


3.9.3  Short-term  vs.  Long-term  Scheduling 

Using  the  algorithm  described  above  in  subsection  3.9.1. 3,  these  decisions  v/ill  result  in  a  long-term 
high  value  as  long  as  an  overload  condition  is  not  maintained  for  a  significant  period  of  time.  A 
critical  part  of  ihe  scheduling  algorithm  will  be  the  computation  of  the  probability  that  an  overload 
condition,  other  than  a  transient  overload,  has  occurred  or  is  about  to  occur,  which  will  result  in 
making  a  decision  about  long-term  reconfiguration  across  the  system  node  boundaries.  In  addition  to 
decisions  about  reconfiguration,  decisions  with  respect  to  process  abortion,  preemption,  and  process 
scheduling  in  support  of  client  processes  outside  the  node  will  be  made  by  the  scheduling  algorithm. 


ArchOS  will  use  a  three  phased  approach  to  handling  the  inter-node  scheduling  control  decisions, 
including  reconfiguration  decisions  as  well  as  decisions  about  scheduling  processes  invoked  across 
node  boundaries  in  support  of  common  tasks,  and  including  such  decisions  as  when  a  process 
should  be  moved  and  the  decision  of  the  best  destination  to  which  it  should  be  moved. 

1 .  Purely  local  information  will  be  utilized.  This  means  that  the  attributes  of  the  functions 
resident  on  the  node  will  be  used,  but  no  inter-node  coordination  of  process  schedtiling 
will  take  place.  Decisions  by  processes  in  one  arobject  to  request  services  in  another 
arobject  (either  local  or  remote)  will  be  made  unilaterally  by  the  requesting  node. 

2.  Primarily  local  information  will  be  utilized,  augmented  with  value  function  and  critical  time 
information  from  the  clients  of  the  processes  being  scheduled.  In  this  way,  scheduling  by 
one  node  on  behalf  of  another  node  will  be  supportive  of  the  needs  of  the  client  node  as 


253 


tar  as  possible,  but  no  coordination  of,  for  example,  two  server  nodes  in  support  of  a  third 
:i;ent  nooe  will  be  performed.  Thus,  it  is  possible  (but  hopefully  unlikely)  that  in  a  heavily 
^oaded  environment  two  such  server  nodes  could  abort  processes  on  behalf  of  clients  in 
such  a  way  that  no  client  gets  adequately  served,  even  though  work  is  being  performed 
on  its  behalf. 

3  Local  information  will  be  augmented  by  some  form  of  negotiation  among  scheduling 
routines  to  ensure  that  work  on  behalf  of  remote  client  processes  is  coordinated,  and,  if 
abortion  is  necessary,  a  best-effort  is  made  to  do  it  in  a  way  which  maximizes  the  total 
system  value. 

3.10  Time- Driven  Virtual  Memory  Subsystem 

The  Time- Driven  Virtual  Memory  (TDVM)  Subsystem  is  responsible  for  managing  the  primary 
memory  page  frames.  It  allocates  those  page  frames  among  the  most  critical  processes,  as 
dynamically  determined  by  the  Time-Driven  Scheduler  (TDS)  Subsystem.  The  basic  goal  of  the  TDS 
and  TDVM  Subsystems  is  to  ensure  that  the  required  CPU  time  and  primary  memory  space  resources 
are  both  available  when  needed,  in  order  to  complete  critical  tasks  within  their  deadlines. 
Furthermore,  when  it  becomes  necessary  to  miss  some  deadlines  due  to  overload  situations,  TDS  and 
TDVM  attempt  to  allocate  the  oversubscribed  resources  to  the  most  important  processes,  so  as  to 
maximize  ttie  total  value  of  the  tasks  completed  (see  Section  3.9). 

The  basic  goal  of  the  TDVM  Subsystem  differs  substantially  from  that  of  a  general  purpose  time¬ 
sharing  system's  virtual  memory  manager.  Although  there  are  many  theoretical  results  and  a  great 
deal  of  practical  experience  in  designing  virtual  memory  systems  for  general  purpose  computing 
environments,  it  is  unclear  how  much  of  this  work  can  be  applied  in  the  real-time  (or  time-driven) 
environment  of  ArchOS.  Indeed,  very  few  real-time  systems  to  date  have  supported  virtual  memory 
facilities,  since  it  is  very  difficult  to  provide  real-time  guarantees  in  the  face  of  unexpected  paging 
activity.  Thus,  the  design  of  the  ArchOS  TDVM  Subsystem,  as  outlined  in  this  Section,  can  only  be 
regarded  as  a  preliminary  version.  It  provides  the  required  functionality,  hut  many  of  the  policies  and 
design  decisions  are  based  on  little  or  no  supporting  data  or  theory.  Thiw  area  of  the  ArchOS  design 
will  be  one  of  the  important  focus  points  for  further  study,  experimentation,  and  development. 

3.10. 1  Memory  Management  Policies  and  Techniques 

In  order  to  manage  the  primary  memory  (space)  resources  in  a  time-driven  fashion,  consistent  with 
the  management  of  the  CPU  (time)  resources,  the  TDVM  Subsystem  uses  the  "working  set  model"  as 
the  basis  for  its  management  policies  and  techniques.  The  working  set  model  for  memory 
management  was  first  described  by  Denning  [Denning  68],  and  it  has  been  the  focus  of  a  great  deal  of 
subsequent  research  (Denning  80).  The  working  set  of  a  process  is  defined  as  the  set  of  its  virtual 


254 


memory  pages  which  have  been  accessed  within  the  last  tV  units  of  CPU  (virtual)  time.  Note  that 
working  sets  are  defined  on  a  per-process  basis,  and  are  determined  with  respect  to  the  per-process 
virtual  times.  The  goal  of  a  working  set  memory  manager  is  to  ensure  that  the  complete  working  set 
for  each  active  process  is  resident  in  primary  memory.  If  there  are  too  many  active  processes,  some 
will  have  to  be  swapped  out  to  secondary  memory  (and  thus  become  inactive),  to  ensure  that  all  of  the 
remaining  working  sets  will  fit  within  primary  memory.  This  guarantees  that  the  active  processes  will 
execute  efficiently,  with  a  minimum  number  of  page  faults  (i.e.  “thrashing"  is  avoided). 

Since  a  working  set  memory  manager  determines  which  pages  are  to  be  resident  in  primary  memory 
on  a  per-process  basis,  it  is  called  a  “local"  memory  management  technique.  This  is  in  contrast  to 
“global"  techniques,  which  are  only  concerned  with  the  rea/jime  since  a  page  was  last  accessed, 
regardless  of  which  process  it  belongs  to.  For  the  purposes  of  time-driven  virtual  memory 
management,  a  local  technique  is  expected  to  be  much  more  effective  in  allocating  the  primary 
memory  resources  to  the  most  important  processes.  This  is  because  it  can  account  for  variations  in 
the  urgency  of  processes,  based  on  their  deadlines  and  the  penalties  for  missing  them,  and  ensure 
that  the  most  urgent  processes  have  their  working  sets  completely  resident. 

A  key  factor  affecting  the  operation  of  a  working  set  memory  manager  is  the  choice  of  W,  the 
working  set  parameter  or  working  set  v^indow  size.  In  traditional  time-sharing  environments,  it  has 
been  found  that  using  a  single  value  of  W,  the  same  for  all  processes,  works  almost  as  effectively  as 
selecting  the  window  size  on  a  per-process  basis.  Also,  the  overall  paging  behavior  of  the  system  is 
not  very  sensitive  to  small  changes  in  W,  although  IV  must  be  tuned  somewhat  to  yield  performance 
that  is  close  to  optimum.®®The  situation  in  a  time-driven  environment  is  somewhat  different,  since  the 
goal  is  to  ensure  that  deadlines,  especially  all  of  the  more  important  deadlines,  are  met,  and  there  is 
much  less  concern  with  overall  system  throughput.  In  this  case  it  could  be  beneficial  to  vary  the 
working  set  window  sizes,  depending  on  the  criticality  of  the  individual  processes.  Furthermore, 
since  the  criticality  of  a  process  can  vary  with  time,  it  may  be  useful  to  dynamically  vary  its  window 
size  as  well. 

Since  the  value  of  IV  depends  upon  the  criticality  of  the  individual  process  (how  near  it  is  to  its 
deadline,  and  the  penalty  for  missing  it),  only  the  TDS  Subsystem  is  in  a  position  to  specify  the  varying 
window  sizes.  Only  a  few,  discrete  window  sizes,  corresponding  to  multiples  of  some  standard 


S6 

If  vv  too  small  excessive  paging  wdl  occur,  becausse  it  will  not  capture  the  “natural  woiKing  sets  lor  many  processes. 
On  the  other  hand,  if  W  is  too  large,  excessive  swapping  will  occur.  This  is  because  the  working  sets  will  be  larger,  due  to 
pages  taking  longer  to  pass  out  of  the  working  set  window,  and  fewer  working  sets  can  then  be  tit  in  primary  memory.  Getting 
w  close  to  optimum  involves  balancing  these  two  effects  so  as  to  maximize  overall  system  throughput. 


window  size,  w,  should  actually  be  needed.  An  adequate  set  of  sizes  might  be  {w,  2w,  4w,  ... ,  128w}. 
Note  that  w  is  a  global  system  tuning  parameter,  corresponding  to  the  single  window  size  of  a 
standard,  ume-sharing,  working  set  memory  manager.  TDS  informs  TDVM  of  the  various  working  set 
window  sizes  at  the  same  time  it  provides  TDVM  with  the  ordered  list  of  runnable  processes,  i.e.  as  a 
result  of  calling  the  Schedule  primitive  (see  Section  3.2.2.4). 

For  the  case  of  time-sharing  working  set  memory  managers,  it  has  been  found  that  maintaining  just 
the  working  set  sizes  (number  of  pages)  across  process  swaps,  without  recording  the  individual 
pages  of  the  working  sets,  still  yields  adequate  performance.  The  working  set  pages  are  simply 
faulted  back  in  on  demand,  after  the  process  has  been  swapped  back  in.  However,  in  a  time-driven 
system  such  paging  behavior  may  not  be  acceptable,  especially  if  the  process  is  nearing  its  deadline. 
As  a  result,  TDVM  will  maintain  the  actual  lists  of  working  set  pages  for  all  processes,  whether  active 
or  swapped.  This  will  allow  the  working  set  of  a  process  to  be  efficiently  reloaded  at  swap-in  time. 

The  choice  of  which  working  sets  should  be  resident  in  primary  memory,  and  which  should  be 
swapped-out,  will  be  made  by  TDVM  based  on  the  order  of  the  processes  in  the  scheduling  list,  which 
is  returned  by  the  TDS  Schedule  (or  RequestSwapList)  primitive.  In  general,  the  most  urgent 
processes  will  be  at  the  head  of  the  scheduling  list,  and  TDVM  will  attempt  to  load  and  maintain  the 
working  sets  of  as  many  of  those  processes  as  possible.  Since  it  is  expected  that  most  processes  will 
gradually  rise  through  the  scheduling  list  as  their  deadlines  approach,  this  technique  should  result  in 
the  preloading  of  most  working  sets,  prior  to  the  time  the  associated  processes  have  to  run.  If  a 
working  set  has  to  be  swapped-out  in  order  to  make  space  for  one  that  is  to  be  swapped-in,  either  the 
unswapped  process  closest  to  the  end  of  the  scheduling  list,  or  the  process  that  blocked  least 
recently,  will  be  selected  for  swapping-out.^lf  the  unswapped  process  that  blocked  least  recently  hsis 
been  blocked  for  more  than  some  threshold  amount  of  time,  it  will  be  the  one  selected  for  swapping- 
out. 

For  the  processes  which  are  currently  active,  techniques  are  required  for  determining  which  pages 
belong  to  their  working  sets.  At  the  time  that  a  new  process  is  created,  there  is  no  execution  history 
available  to  indicate  what  its  initial  working  set  ought  to  be.  In  order  to  reduce  the  number  of  initial 
page  faults,  and  hence  improve  the  efficiency  of  process  “loading"  and  startup,  TDVM  will  allow  an 
initial  working  set  to  be  specified.  This  initial  working  set  could  either  be  determined  heuristically  (for 
example,  x  percent  of  the  User  Text  and  y  percent  of  the  User  Data),  or  it  could  be  based  on 
information  obtained  through  previous  executions  of  the  process. 

^Thc  scheduling  list  will  contain  all  processes  '.vhich  are  "steeping"  (due  to  Delay  or  Alarm  primitives),  but  any  processes 
blocked  for  other  reasons  (such  a.s  Accept  primmvr.s)  will  not  appear  in  that  list. 


256 


As  a  process  runs,  TDVM  must  dynamically  adjust  the  process'  working  set,  to  track  the  process 
through  different  phjises  of  its  execution.  The  primary  way  that  pages  get  added  to  a  working  set  is 
through  page  faults,  i.e.  as  new  pages  are  accessed  they  get  paged-in  "on  demand".  If  TDVM  detects 
sequential  paging  activity  in  one  of  the  data  segments  of  a  process’  address  space,  it  can  attempt  to 
do  sequential  pre-paging.  This  technique  can  be  very  effective  in  improving  the  performance  of 
sequentially  accessed  File  Arobjects,  and  in  many  other  situations  as  well.  One  other  way  that  pages 
can  get  added  to  a  working  set  is  if  the  working  set  window  size  is  increased.®®When  TDVM  detects 
an  increase  in  the  window  size  (by  checking  the  value  in  the  scheduling  list,  obtained  through  the 
TDS  Schedule  primitive),  it  must  scan  through  the  “last  accessed  times"  of  all  pages  in  the  address 
space.  This  will  allow  it  to  determine  which  pages  should  be  added  to  the  working  set  (paged-in),  in 
addition  to  those  already  there. 

The  removal  of  pages  from  a  working  set  occurs  when  pages  fall  outside  of  the  working  set  window. 
This  can  happen  either  as  a  result  of  the  process’  virtual  time  advancing  more  than  W  units  since  one 
or  more  of  the  working  set  pages  were  last  accessed,  or  as  a  result  of  the  working  set  window  size 
decreasing.  A  decrease  in  the  window  size  is  detected  in  the  same  manner  as  an  increase,  but  in  this 
case  only  the  current  working  set  pages  need  be  scanned  to  determine  which,  if  any,  are  to  be 
removed  from  the  working  set.  Determining  the  last  accessed  times  for  pages  of  the  v/orking  .set  can 
only  be  done  approximately,  assuming  no  specialized  virtual  memory  hardware  other  than  USED  flags 
is  available.  It  is  accomplished  by  scanning  the  USED  flags  of  the  working  set  pages  every  time  w 
units  of  virtual  time  have  passed.®®For  every  page  in  which  the  USED  flag  is  set,  that  flag  is  cleared 
and  the  iast  accessed  time  is  updated  to  the  current  virtual  time.  For  any  page  in  which  the  USED  flag 
was  clear,  the  last  accessed  time  is  checked  to  see  if  it  falls  outside  of  the  working  set  window.  If  so, 
that  page  is  removed  from  the  working  set.®° 

One  other  major  responsibility  of  the  TDVM  Subsystem  is  management  of  the  free  page  frame  pool. 
A  free  page  is  any  primary  memory  page  which  does  not  belong  to  one  of  the  currently  active  working 
sets.  Such  a  page  can  be  in  any  one  of  three  different  states: 

1.  On  the  Page-Out  List:  It  corresponds  to  an  existing  page  in  a  secondary  memory  page 
set,  but  it  has  been  modified  and  hence  must  be  written  back  to  the  page  set  before  it  can 
be  reused. 


5fi 

This  happens  when  TDS  determines  that  it  has  now  become  more  important  to  avoid  page  faults,  because  the  process  is 
nearing  its  deadline  and  the  penalty  for  missing  the  deadline  is  high. 

59 

Recall  that  W,  the  working  set  window  size,  is  some  multiple  of  w,  the  base  window  size. 

GO 

Note  that,  in  pr.ictice.  working  set  scans  could  be  initiated  at  the  time  of  processor  rescheduling  decisions  (calls  to 
Schedule),  assuming  tli.-’t  such  decisions  occur  at  intervals  ol  at  most  w  units  of  real  time.  In  that  c.ase  the  current  process' 
working  sot  need  only  bo  scanned  if  the  process  has  accumulated  at  least  w  units  of  virtual  time  since  it  was  last  scanned. 


257 


2.  On  the  Reclaim  List:  It  corresponds  exactly  to  an  existing  page  in  a  secondary  memory 
page  set,  and  hence  it  is  free  to  be  reused  without  first  writing  it  back  to  the  page  set. 

3.  On  the  Free  List;  It  is  completely  free  to  be  reused,  since  it  does  not  correspond  to  any 
existing  page  in  a  secondary  memory  page  set. 

When  a  page  fault  occurs,  TDVM  checks  the  Page-Out  and  Reclaim  Lists  to  see  if  the  required  page 
is  already  in  primary  memory,  and  hence  can  be, quickly  “reclaimed".  If  so,  the  time  and  expense  of 
reading  the  page  from  its  secondary  memory  page  set  can  be  avoided.  Otherwise  a  free  page  must 
be  selected  to  hold  the  contents  of  thr  page  as  it  is  read  from  the  page  set.  Pages  are  selected  first 
from  the  Free  List,  but  if  it  is  empty,  then  they  will  be  selected  from  the  Reclaim  List.  The  Reclaim  List 
iS. maintained  in  FIFO  order,  so  that  those  pages  which  have  been  free  for  the  longest  time  will  be 
selected  first  for  reuse.  If  the  Reclaim  List  is  also  empty,  then  a  page  from  the  Page-Out  List  will  have 
to  be  selected.  Of  course,  that  page  will  have  to  first  be  written  back  to  its  corresponding  page  set, 
before  it  can  be  reused.  The  Page-Out  List,  like  the  Reclaim  List,  is  maintained  in  FIFO  order,  so  the 
page  that  has  been  free  the  longest  will  be  the  first  to  be  reused.  In  the  event  that  no  free  page  can  be 
found,  i.e.  the  Page-Out  List  is  also  empty,  then  TDVM  must  initiate  the  swapping-out  of  a  process  in 
order  to  make  some  free  pages  available. 

To  help  avoid  the  situation  of  having  to  page-out  or  swap-out  at  page  fault  time  (i.e.  to  help  reduce 
the  elapsed  time  for  handling  a  page  fault),  TDVM  attempts  to  maintain  a  mimirnum  number  of  pages 
in  the  Free  List  and  Reclaim  List  combined.  When  the  number  of  available  pages  in  those  lists  drops 
below  a  threshold  (approximately  five  percent  of  the  primary  memory  page  frames),  TDVM  will  initiate 
page-out  operations  in  order  to  move  some  pages  from  the  Page-Out  List  to  the  Reclaim  List.  If  there 
are  still  not  enough  free  pages  available,  then  TDVM  will  initiate  a  swap-out  operation  in  order  to  free 
all  of  the  working  set  pages  from  some  process  (the  least  urgent  one).  Note  that  checking  the  free 
page  threshold  and  initiating  page-out  and  swap-out  operations  can  be  done  at  the  time  of  processor 
rescheduling  decisions,  just  as  for  the  case  of  initiating  working  set  scan  operations. 

3.10.2  Time-Driven  Virtual  Memory  Subsystem  Primitives 

The  TDVM  Subsystem  provides  the  following  set  of  primitives,  primarily  for  use  by  the 
Arobject/Process  Management  (A/PM)  Subsystem: 


258 


val  =  VMPrePage(asid,  vpa,  nvp) 
val  =  VMLock(asid,  vpa.  nvp) 
val  =  VMUnlock(asid,  vpa,  nvp) 

pid  3  VMPageFault(asid,  vpa) 
pid  3  VMScheduleO 

val  =  VMFIush(asid,  vpa,  nvp) 

val  3  VMSwapQ 

gpsid  3  VMFreeze(asid) 

val  3  VMUnfreeze(gpsid  [,  asid]) 

val  3  VMMove(asid,  vpa,  desired-vpa) 
val  3  VMZero(asid,  vpa,  nvp) 

val  3  VMDestroy(asid) 
val  3  VMFree(asid,  vpa) 

val  3  VMStatus(statusbuffer  [,  asid]) 
val  3  VMRestartQ 


BOOLEAN  val  TRUE  if  the  specified  operation  is  completed  successfully:  otherwise  FALSE. 


PID  pid  The  process  ID  of  the  process  which  is  to  be  run  at  this  time. 

GPSID  gpsid  Global  Page  Set  ID,  which  identifies  the  temporary  page  set  containing  the  frozen 

(swapped)  address  space  descriptor  information.  It  includes  the  ID  of  the  logical 
disk  containing  the  page  set,  the  page  set  type  (TEMPORARY),  and  the  unique  ID 
of  this  page  set  within  the  logical  disk. 


ASID  asid  Address  Space  ID,  which  identifies  the  address  space  to  be  operated  upon.  The 

corresponding  process  ID  can  be  easily  obtained  from  an  ASID,  and  vice  versa. 

VIRT-PAGE-ADDRESS  vpa,  desired-vpa 

A  virtual  page  address  within  an  address  space. 


INT  nvp  The  number  of  virtual  pages  involved  in  the  operation. 


VMSTATUSBUFFER  ‘statusbuffer 

The  buffer  address  for  returning  status  information. 

On  Error:  Error  conditions  are  indicated  by  the  use  of  special  return  values.  The  details 

concerning  the  precise  nature  of  an  error  condition  are  provided  in  the  Kernel 
Error  Block. 


259 


The  VMPrePage  primitive  initiates  the  paging-in  of  the  specified  set  of  virtual  pages  (nvp  pages, 
beginning  with  page  vpa),  in  address  space  asid.  This  is  useful  for  setting  up  an  initial  working  set  of 
pages  for  a  newly  created  process,  since  otherwise  there  is  no  execution  history  to  indicate  what  the 
working  set  ought  to  be.  The  VMLock  primitive  is  similar  to  VMPrePage  in  that  it  causes  the  specified 
set  of  virtual  pages  to  be  paged-in,  if  not  already  in  primary  memory.  However,  VMLock  is 
synchronous  in  the  sense  that  it  will  not  return  until  all  of  the  specified  pages  are  actually  present  in 
primary  memory.  Furthermore,  those  pages  will  be  “locked”  in  primary  memory,  i.e.  they  will  not  be 
considered  eligible  for  removal  from  the  address  space's  working  set,  regardless  of  how  long  it  has 
been  since  they  were  last  accessed.  VMUnlock  will  “unlock”  the  specified  set  of  virtual  pages, 
making  them  eligible  for  removal  from  the  working  set,  and  allowing  the  corresponding  physical  page 
frames  to  be  subsequently  reused. 

VMPageFauit  is  the  means  by  which  the  TDVM  Subsystem  is  notified  of  page  faults.  It  is  assumed 
that  the  specified  virtual  page  (vpa)  is  indeed  one  of  the  allocated  pages  of  address  space  asid,  and 
hence  the  page  fault  is  truly  due  to  accessing  a  valid  page  that  isn’t  resident  in  primary  memory 
(rather  than  a  protection  or  invalid  address  fault).  VMPageFauit  first  checks  to  see  if  the  missing  page 
is  already  present  somewhere  in  primary  memory.  This  would  be  the  case  if  the  page  had  been  in 
primary  memory  earlier,  was  “freed"  due  to  inactivity,  but  the  associated  page  frame  had  not  yet  been 
reused.  It  would  also  be  the  case  if  the  page  is  shared  and  is  already  resident  as  part  of  another 
process’  working  set.  If  the  missing  page  is  found  in  primary  memory,  the  page  fault  is  handled  very 
quickly,  with  no  need  to  switch  processes.  In  this  case  VMPageFauit  will  return  the  ID  (pid)  of  the 
current  process,  which  is  the  one  that  encountered  the  page  fault.  Otherwise  VMPageFauit  must 
initiate  a  page-in  operation.  Since  reading  a  page  from  a  secondary  memory  page  set  can  take  tens 
of  milliseconds  (or  even  hundreds  of  milliseconds  if  paging-out  is  required  before  space  is  available 
for  the  new  page),  a  process  switch  is  usually  desireable  when  paging-in  is  required.  In  such  cases, 
VMPageFauit  will  return  the  ID  of  the  process  which  should  now  run  in  place  of  the  faulting  process. 

VMSchedule  is  intended  to  be  the  primary  means  by  which  the  Arobject/Process  Manager 
determines  which  process  is  to  be  running  at  the  present  time.  VMSchedule  calls  the  Schedule 
function  of  the  Time-Driven  Scheduler,  to  obtain  the  ordered  list  of  processes  which  are  to  be  run 
next.  From  this  list  the  TDVM  Subsystem  can  determine  which  address  space  working  sets  will  be 
required  in  primary  memory  in  the  near  future,  allowing  it  to  initiate  any  necessary  page-in  operations. 
Related  to  this,  VMSchedule  is  also  responsible  for  scanning  the  working  set  pages  of  the  current 
process,  to  see  if  any  of  them  have  not  been  accessed  recently  and  can  thus  be  removed  from  the 
working  set.  This  scanning  operation  is  only  initiated  if  the  process  has  accumulated  sufficient  virtual 
(CPU)  time  since  its  working  set  was  last  scanned.  VMSchedule  returns  the  ID  of  the  process  which  is 


to  run  now.  It  guarantees  that  the  address  space  for  that  process  is  the  active  one  (see  the 
description  of  ASActivate  in  Section  3.2.2.5). 

The  VMFlush  primitive  ensures  that  the  specified  set  of  virtual  pages  from  the  given  address  space 
are  all  "clean”,  i.e.  any  primary  memory  images  of  these  pages  are  paged-out,  if  they  have  been 
flagged  a ;  MODIFIED.  VMFlush,  like  VMLock,  is  synchronous.  It  will  not  return  until  ail  of  the 
modified  pages  have  been  written  out  to  their  cosresponding  page  sets.  VMFlush  is  primarily  intended 
to  support  the  FlushPermanent  primitive  of  the  Arobject/Process  Manager  (see  Section  3.3.3).  It  is 
also  used  when  "deactivating"  arobjects,  to  ensure  that  all  modifications  have  been  flushed  out  to  the 
appropriate  page  sets  before  the  address  spaces  are  destroyed.  Unless  directed  to  flush  modified 
pages  through  VMFlush,  the  TDVM  Subsystem  will  only  iaitiate  page-out  activity  when  primary 
memory  begins  to  fill  up,  and  some  of  the  modified  but  not  recently  accessed  page  frames  have  to  be 
reused. 

VMSwap  provides  a  mechanism  through  which  the  Arobject/Process  Manager  can  explicitly 
request  the  TDVM  Subsystem  to  “swap  out”  one  of  the  existing  but  currently  idle  address  spaces. 
Ordinarily,  the  TDVM  Subsystem  will  only  sv/ap  out  an  address  space  if  the  combined  working  sets  of 
all  of  the  existing  address  spaces  overflow  primary  memory.  In  that  case  TDVM  calls  the  Time-Driven 
Scheduler’s  RequestSwapList  function,  to  obtain  the  ordered  list  of  swappable  processes.  TDVM 
selects  the  most  appropriate  candidate  from  this  list.®’ All  of  the  pages  from  the  selected  process’ 
working  set  are  released,  so  that  they  can  be  paged-out  and  reused  as  required.  TDVM  then  writes  all 
of  the  relevant  address  space  descriptor  information  (including  the  list  of  working  set  pages)  into  a 
temporary  page  set.®^The  address  space  is  then  destroyed  (using  ASDestroy),  which  frees  the 
associated  address  space  resources  for  use  by  others.  This  freeing  of  address  space  resources  is  the 
primary  reason  for  making  VMSwap  available  to  the  Arobject/Process  Manager.  It  provides  a 
mechanism  for  recovering  from  “space  exhausted”  problems  when  creating  or  growing  address 
spaces. 

VMFreeze  is  quite  similar  to  VMSwap,  except  that  it  allows  a  specific  address  space  (asid)  to  -be 
selected  for  "swapping”.  If  that  address  space  is  already  swapped  out,  the  global  page  set  ID  {gpsid) 

fil 

The  least  “urgent”  process,  for  which  the  expected  continuation  time  is  farthest  in  the  future,  and  -which  hasn't  already 
been  swapped  out,  is  selected.  However,  processes  containing  locked  address  space  pages  are  not  considered  eligible  for 
swapping. 

62 

It  is  riot  necessary  to  record  all  of  the  address  space  information.  Only  the  Private  Region  segment  descriptors,  and  the 
EXISTS/COPIED  flag  for  each  page  'n  those  segments  need  be  saved.  If  this  was  the  l.-ist  process  from  the  associated  arobject 
which  was  still  resident  on  this  node  and  not  yet  swapped  out,  then  the  Shared  Rr-gion  segment  descriptors  and  the 
corresponding  EXISTS  flags  will  also  have  to  be  saved. 


261 

of  the  temporary  page  set  containing  the  address  space  descriptor  is  returned,  after  first  ensuring 
that  all  MODIFIED  pages  from  that  address  space  have  been  flushed.®^lf  the  specified  address  space 
is  not  currently  swapped  out,  it  is  flushed  and  swapped  (even  if  it  contains  locked  pages),  and  the 
gpsid  of  the  temporary  page  set  containing  the  address  space  descriptor  is  returned.  VMFreeze 
provides  support  for  the  Arobject/Process  Management  Subsystem's  FreezeArobject  and 
FreezeProcess  primitives  (see  Section  APMSEC).  It  is  also  used  when  migrating  processes  from  one 
node  to  another. 

Following  a  VMFreeze  operation,  the  TDVM  Subsystem  removes  all  record  of  address  space  as/d, 
except  for  the  page  sets  on  secondary  memory.  The  VMUnfreeze  primitive  can  be  used  to  restore  a 
previously  frozen  address  space  (indicated  by  gpsid,  the  ID  of  the  page  set  containing  the  address 
space  descriptor),  to  the  swapped  state  (see  Figure  3-33).  VMUnireeze  provides  support  for  the 
UnfreezeArobject  and  UnfreezeProcess  primitives  of  the  A/PM  Subsystem  (see  Section  APMSEC). 
Optionally,  an  address  space  can  be  given  a  new  ID  (asid)  at  the  time  it  is  unfrozen.  This  is  useful 
when  migrating  processes,  since  it  allows  the  address  space  to  be  associated  with  a  new  process  ID 
on  the  new  node. 


Figure  3*33:  Life  Cycle  of  an  Address  Space 


VMMove  is  the  Arobject/Process  Manager’s  interface  to  the  ASMove  function,  which  changes  the 
location  of  an  existing  segment  in  address  space  asid  from  vpa  to  desired-vpa.  This  function  must  go 
through  the  TDVM  Subsystem,  so  that  TDVM  can  update  any  references  it  may  have  to  the  affected 
virtual  pages.  See  Section  3.2.2.5  for  more  details  concerning  the  ASMove  primitive.  VMZero 


63 

When  an  address  space  is  swapped,  .ill  of  its  MODIFIED  pages  are  flagged  lor  paging-out.  but  may  not  be  flushed 
immediately  A  frozen  address  space,  on  the  other  hand,  is  guaranteed  to  have  been  completely  Hushed. 


262 


provides  an  efficient  mechanism  for  “zeroing”  complete  pages  of  an  address  space.  Ifdoes  this  by 
clearing  the  EXISTS  flags  for  the  specified  virtual  pages,  so  that  the  next  time  any  of  these  pages  are 
accessed  they  will  first  be  zero-filled.  If  the  pages  to  be  zeroed  already  exist  on  the  corresponding 
page  set  (the  EXISTS  flags  were  set),  then  VMZero  will  also  call  the  appropriate  PSZero  or  APSZero 
primitive  to  zero  (or  free)  the  pages  from  the  page  set.  One  of  the  main  purposes  of  the  VMZero 
primitive  is  to  provide  efficient  support  for  the  ZeroFile  primitive,  which  is  provided  by  the  File  System 
Client  Interface  (see  Section  3.6). 

VMDestroy  and  VMFree,  like  VMMove,  are  the  A/PM  Subsystem's  interface  to  the  corresponding 
Address  Space  Management  Primitives  {ASD.estroy  and  ASFree).  VMDestroy  completely  destroys  the 
specified  address  space  (as/d),  while  VMFree  deletes  a  segment  (indicated  by  vpa)  within  address 
space  as/d.  These  functions  must  go  through  the  TDVM  Subsystem,  so  that  TDVM  can  remove  any 
references  it  may  have  to  the  affected  address  space  or  segment,  and  free  any  corresponding  primary 
memory  page  frames.  See  Section  3.2.2.5  for  more  details  concerning  the  ASDestroy  and  ASFree 
primitives. 

The  VMStatus  primitive  is  used  to  obtain  general  information  concerning  the  current  status  of  the 
TDVM  Subsystem.  This  includes  the  number  of  active  and  swapped  address  spaces,  the  average  size 
of  the  address  space  working  sets,  the  number  of  free  pages  available,  and  the  total  number  of 
page-in,  page-out,  and  page-reclaim  (avoided  or  fast  page-in)  operations.  This  information  can  be 
used  for  monitoring  the  system’s  virtual  memory  activity,  so  that  the  TDVM  Subsystem  can  be  tuned  to 
provide  better  performance.  If  an  optional  address  space  ID  (as/d)  is  specified,  then  VMStatus  will 
return  virtual  memory  information  related  to  that  particular  address  space.  This  includes  its  status 
(SWAPPED  or  ACTIVE),  working  set  size,  and  page-in,  page-out,  and  page-reclaim  statistics. 

The  VMRestart  primitive  is  the  initialization  operation  for  the  TDVM  Subsystem  (on  a  particular 
node).  It  is  assumed  to  be  called  automatically  as  the  first  TDVM  operation,  upon  restarting 
(rebooting)  a  failed  node.  Its  sole  purpose  is  to  create  the  TDVM  v/orker  (kernel)  processes,  and 
initialize  the  virtual  memory  management  data  structures.  These  TDVM  internal  processes  and  data 
structures  are  described  in  the  following  section. 

3.10.3  Components  of  the  Time-Driven  Virtual  Memory  SubE»/stem 

The  TDVM  Subsystem  is  implemented  as  a  kernel  arobject,  with  a  separate  instance  on  each  node 
of  the  distributed  computer  system.  Each  instance  is  solely  responsible  for  the  memagement  of  the 
primary  memory  page  frames  on  its  own  node,  and  has  no  need  to  communicate  or  cooperate  with 
any  of  its  peers  on  other  nodes.  Each  instance  of  the  TDVM  Subsystem  consists  of  three  processes 


263 


and  five  main  data  structures,  as  illustrated  in  Figure  3-34.  The  TDVM  Manager  process  provides 
almost  all  of  the  subsystem’s  functionality,  and  is  the  only  process  which  can  access  and  modify  the 
main  data  structures.  The  sole  purpose  of  the  Page-In  Worker  and  Page-Out  Worker  processes  is  to 
allow  paging  (page  set  read  and  write)  operations  to  be  performed  asynchronously  with  other  TDVM 
Subsystem  processing.  The  two  worker  processes  are  basically  surrogates  for  the  TDVM  Manager. 
They  wait  for  page  set  operations  to  be  completed  on  behalf  of  the  TDVM  Manager,  allowing  the 
Manager  to  continue  handling  other  virtual  memory  related  operations. 


Working 

Sots 

List 


Pigo-ln 

List 


Psgs-Out 

List 


Rtcisin 

List 


Frtt 

List 


Figu  re  3-34:  Components  of  the  Time  Driven  Virtual  Memory  Subsystem 


The  first  main  data  structure  indicated  in  Hgure  3-34  is  the  Address  Space  Working  Sets  List.  This 
data  structure  is  illustrated  in  more  detail  in  Figure  3-35.  The  Address  Space  List  contains  an  entry 
for  every  address  space  that  is  known  to  the  TDVM  Subsystem  (whether  it  is  currently  active  or 
swapped  out).  Each  entry  indicates  the  current  state  of  the  address  space  (ACTIVE  or  SWAPPED).  If 
SWAPPED,  the  global  page  set  ID  {gpsid)  of  the  temporary  page  set  containing  the  address  space’s 


descriptor  is  provided.  See  the  second  entry  in  the  Address  Space  List  in  Figure  3-35  for  an  example. 
Otherwise  the  entry  contains  information  concerning  the  address  space’s  current  working  set  of 
virtual  pages. 


Figure  3*35:  Address  Space  Working  Sets  List 


For  ACTIVE  address  spaces,  the  "last  scan  time"  {time)  and  working  set  window  si^e  (W).  are 
provided,  in  addition  to  the  actual  list  of  virtual  pages  which  comprise  the  working  set.  The  last  scan 
time  is  the  virtual  (CPU)  time  at  which  the  address  space  was  last  scanned  to  determine  which  pages 
belong  to  the  working  set  (were  accessed  recently).  The  W  parameter  defines  "recently”,  by 
specifying  the  virtual  time  frame  during  which  a  page  must  have  been  accessed,  in  order  to  be 
considered  a  member  of  the  current  working  set.  The  list  of  working  set  pages  is  arranged  in  clusters, 
where  each  cluster  is  indicated  by  a  starting  address  (vpa)  and  its  size  (nvp).  Due  to  the  "locality  of 
references",  clustering  of  working  set  pages  is  expected  to  be  quite  common.  Thus,  this 
representation  of  working  sets  will  save  space,  and  it  will  also  improve  the  efficiency  of  page-in, 
page-out,  swap-in,  and  swap-out  operations,  since  entire  clusters  can  be  read  and  written  at  a  time. 
Also  associated  with  each  cluster  is  a  took  flag,  which  indicates  whether  or  not  those  pages  are 
locked  in  primary  memory  (as  a  result  of  a  previous  VMLock  primitive). 

The  second  main  data  structure  of  the  TDVM  Subsystem  is  the  Page-In  List.  This  list  is  actually 
maintained  as  two  separate  lists,  one  for  urgent  page-in  operations  (those  resulting  from  page  faults), 
and  another  for  the  less  urgent  "pre-paging"  operations  (including  the  swapping  in  and  expanding  of 
the  working  sets  of  address  spaces  which  have  increased  in  "urgency").  The  use  of  two  lists  allows 
the  urgent  page-in  operations  to  be  given  precedence  over  pre-page  operations.  The  two  Page-In 
Lists  have  identical  structures,  as  illustrated  in  Figure  3-36. 


Figure  3-36:  Page-In  Lists 


Each  entry  in  a  Page-In  List  indicates  the  address  space  [asid)  and  the  virtual  page  (vpa)  of  a  virtual 
page  that  is  to  be  paged-in.  The  corresponding  page  set  and  page  number  can  then  be  determined 
using  the  ASGetGPSiD  or  ASGetSTE  primitives  (see  Section  3.2.2. 5).  Each  entry  can  also  indicate  a 
range  or  cluster  of  nvp  virtual  pages,  beginning  with  virtual  page  vpa.  Normally,  nvp  =  1  for  entries  in 
the  Urgent  List,  since  this  will  allow  the  corresponding  process  to  continue  executing  as  quickly  as 
possible.®®However,  the  clustering  of  page-in  operations  in  the  Pre-Page  List  will  help  improve  the 
overall  paging  throughput.  Each  entry  in  a  Page-In  List  also  contains  a  lock  flag,  which  indicates 
whether  that  cluster  of  pages  is  to  be  locked  into  the  corresponding  address  space’s  working  set, 
after  it  has  been  paged-in. 

Two  of  the  other  main  data  structures  of  the  TDVM  Subsystem  are  the  Page-Out  List  and  the 
Reclaim  List,  These  two  structures  are  closely  related  in  that  they  both  contain  information  about 
primary  memory  page  frames  which  no  longer  belong  to  any  address  space’s  working  set.  The  only 
difference  between  the  two  lists  is  that  the  pages  in  the  Page-Out  List  have  been  modified,  and  hence 
they  must  be  written  back  to  their  corresponding  page  sets  before  they  can  be  reused.  The  Page-Out 
List  and  the  Reclaim  List  each  have  the  same  logical  structure,  as  illustrated  in  Figure  3-37. 

Each  entry  in  the  Page-Out  List  or  Reclaim  List  represents.a  cluster  of  npages  physical  page  frames, 
beginning  with  ppa.  This  cluster  corresponds  to  the  npages  of  page  set  gpsid,  beginning  with  page 
pnum.^ The  Page-Out  List  and  Reclaim  List  together  allow  the  TDVM  Subsystem  to  quickly  "reclaim" 
pages  which  have  not  yet  been  reused,  thus  avoiding  unnecessary  page-in  operations.  Although 


^Some  experimentation  is  needed  to  determine  in  which  circumstances,  if  any,  it  might  be  desireable  to  perform  clustered 
urgent  page- in  operations. 

67 


Once  again  the  clustering  of  pages,  especially  in  the  Page-Out  List,  should  help  improve  paging  throughput. 


266 


Pag»-Out  List 


Rtelalm  List 


gpsid 

pnum 

ppa 

npages 

gpsid 

pnum 

ppa 

npagaa 

Figure  3-37:  Page-Out  and  Reclaim  Lists 


conceptually  they  are  simple  lists  /as  illustrated  in  Figure  3-37),  in  practice  the  Page-Out  List  and 
Reclaim  List  would  be  sorted  into  binary  search  trees,  using  \gpsid,  pnum)  as  the  key.  This  would 
make  the  searching  for  reclaimable  pages,  as  well  as  the  insertion  of  new  pages  into  existing  clusters 
very  fast.  In  addition,  each  entry  in  the  Page-Out  List  and  Reclaim  List  would  be  threaded  in  FIFO 
order  on  its  respective  “reuse  queue".  This  ordering  would  allow  the  least  recently  used  pages  to  be 
selected  for  reuse,  whenever  the  pool  of  “free"  page  frames  was  exhausted. 

The  final  main  data  structure  of  the  TDVM  Subsystem  is  the  Free  Page  Frame  List,  which  is 
illustrated  in  Figure  3-38.  Each  entry  in  the  Free  List  has  a  very  simple  structure,  and  represents  a 
cluster  of  npages  physical  page  frames,  beginning  with  ppa.  Each  cluster  is  completely  free  to  reuse, 
since  it  does  not  correspond  to  any  existing  virtual  address  space  pages.  In  practice,  the  Free  List 
would  be  sorted  into  a  binary  search  tree,  using  ppa  as  the  key.  This  would  make  the  insertion  of 
newly  freed  page  frames  into  existing  clusters  very  fast.  In  addition,  each  entry  would  be  threaded  on 
an  appropriate  list  according  to  its  cluster  size  {npages).  There  are  32  separate  cluster  size  lists,  one 
for  each  power  Oi  two.  This  arrangement  helps  reduce  primary  memory  "fragmentation”,  allowing 
page  in  clusters  of  the  appropriate  size  to  be  found  very  quickly. 


Figu re  3-38:  Free  Page  Frame  List 


267 


3.10.4  Interaction  With  Arobject/Process  Manager  and  Time-Driven  Scheduler 
The  TDVM  Subsystem  interacts  extensively  with  both  the  Arobject/Process  Manager  and  the  Time 
Driven  Scheduler,  on  the  local  node.  Figure  3-39  illustrates  the  primary  "uses”  relationships  among 
these  subsystems.  A/PM  interacts  with  TDVM  in  two  main  ways,  either  directly  through  the 
invocation  of  TDVM  primitives,  or  indirectly  through  the  manipulation  of  address  spaces.  Since  most 
address  space  manipulations  are  actually  performed  by  TDVM,  and  TDVM  must  maintain  consistency 
between  its  own  data  structures  and  those  of  the  Address  Space  Management  Routines,  A/PM  is 
quite  restricted  in  terms  of  the  address  space  operations  it  is  allowed  to  perform.  In  particular,  certain 
operations  {ASDestroy,  ASFree,  ASMove)  must  be  invoked  by  A/PM  via  the  corresponding  TDVM 
primitives,  as  noted  earlier.  However,  address  space  creation  (ASCreate)  and  growth  (ASAllocate, 
ASExpand)  can  both  be  handled  directly  by  A/PM,  since  TDVM  will  learn  about  them  automatically  as 
the  new  address  space  and  pages  are  used. 


Figure  3-39:  TDVM,  A/PM,  and  TDS  Subsystem  Interactions 


Besides  address  space  manipulations,  the  other  major  area  of  A/PM  and  TDVM  interaction  is 
process  scheduling.  The  determination  of  which  process  is  the  best  choice  to  be  running  on  a  node 
at  any  given  time  is  ultimately  the  responsibility  of  the  Time- Driven  Scheduler.  However,  TDS  is  not 
aware  of  the  status  of  the  process  address  spaces  (ACTIVE  or  SWAPPED),  and  thus  it  could  select  a 


268 


process  which  TDVM  has  not  yet  had  time  to  swap  back  in.®®ln  order  to  coordinate  properly  between 
TDVM  and  TDS  on  the  choice  of  which  process  to  run  at  the  present  time,  A/PM  must  always  use  the 
VMSchedule  primitive,  rather  than  directly  invoke  the  TDS  Schedule  primitive.  Figure  3-40  illustrates 
the  normal  sequence  of  events  which  occur  when  A/PM  must  reschedule  the  processor,  due  to  the 
current  process  blocking  or  completing. 


Figu  re  3>40:  Normal  Processor  Rescheduling  Sequence 


A/PM  first  invokes  the  TDS  Deschedule  primitive,  and  after  it  completes  (returns  val)  A/PM  enters 
the  “idle  state”,  waiting  for  a  “Reschedule"  signal  to  indicate  that  another  process  is  ready  to  run. 
When  TDS  has  determined  which  process  to  run  next,  it  sends  the  Reschedule  signal  (interrupt)  to 
A/PM.®^Upon  receipt  of  the  Reschedule  signal.  AP/M  calls  VMSchedule  to  determine  exactly  which 
process  is  to  be  run  next,  and  to  switch  to  its  associated  address  space.  VMSchedule  itself  calls  the 
TDS  Schedule  primitive,  to  see  which  process  the  scheduler  has  selected.  Assuming  the  selected 
process  is  not  swapped  out,  TDVM  activates  the  process’  address  space  (by  calling  ASActivaie),  and 
returns  the  ID  of  the  process  (p/d)  to  A/PM.  A/PM  then  completes  the  switch  to  process  p/d,  and  sets 
it  running. 

If  the  process  selected  by  the  TDS  Schedule  primitive  (the  first  process  returned  in  pid-list)  is 
currently  swapped  out,  VMSchedule  will  initiate  the  swapping-in  of  that  process,  and  select  a  different 


68 

This  should  seldom  happen,  but  it  can  occur  if  all  of  the  most  urgent  processes  block  in  rapid  succession,  waiting  for 
various  events.  The  best  choice  among  the  remaining  “ready"  processes  could  then  be  one  that  is  still  swapped  out,  since 
TDVM  h.id  not  anticipated  the  need  to  run  it  so  soon.  Another  way  this  can  happen  is  il  a  process  which  was  blocked  lor  a  long 
time  and  got  swapped  out  suddenly  becomes  ready  to  run  ngain,  .and  it  is  immediately  selected  as  the  most  urgent  process. 

If  .another  process  is  already  available  and  ready  to  run,  there  should  be  es.sentially  zero  delay  before  TDS  sends  the 
Raschartuto  signal. 


269 


pid  from  pid-list.  To  be  selected,  a  process  must  be  the  first  one  in  pid-list  which  is  not  currently 
swapped  out,  awaiting  completion  of  a  page-in  operation,  or  put  on  "hold"  by  TDS.^If  no  appropriate 
process  can  be  found,  VMSchedule  returns  BAD-PID  to  A/PM.  This  informs  A/PM  that  it  should 
return  to  the  idle  state,  rather  than  switch  proc  jsses  at  this  time. 

In  addition  to  rescheduling  the  processor  when  the  current  process  blocks  or  completes, 
preemptive  rescheduling  can  also  occur.  Preemptive  rescheduling  due  to  an  urgent  process 
unblocking  works  almost  identically  to  the  normal  rescheduling  sequence  outlined  above,  and 
illustrated  in  Figure  3-40.  The  only  difference  is  that  A/PM  invokes  the  TDS  SetScheduleinfo  primitive 
(rather  than  Deschedule),  upon  the  occurrence  of  an  event  for  which  a  process  is  waiting  (blocked). 
This  informs  TDS  that  the  previously  blocked  process  is  now  schedulable,  so  that  TDS  can  determine 
when  it  should  be  run.  In  the  meantime,  A/PM  can  allow  the  process  which  was  running  at  the  time  of 
the  event  to  continue  executing.  If  TDS  determines  that  the  newly  unblocked  process  should  be  run 
immediately,  it  will  send  a  Reschedule  signal  to  A/PM.  A/PM  will  then  save  the  state  of  the  current 
process,  and  initiate  the  standard  VMSchedule  sequence  discussed  above.  Similarly,  if  at  any  point 
TDS  wishes  to  preemptively  reschedule  the  processor  (because  an  urgent  process  has  just 
completed  its  delay,  or  there  has  been  some  other  significant  change  in  the  value  functions  of 
processes),  it  can  do  so  by  simply  sending  a  Reschedule  signal  to  A/PM. 

Two  other  reasons  for  rescheduling  the  processor,  of  particular  relevance  to  TDVM,  are  the 
occurrence  of  page  faults  and  the  completion  of  page-in  and  swap-in  operations.  The  sequence  of 
events  which  occurs  upon  page  fault  is  illustrated  in  Rgure  3-41.  A  process  execution  fault  first 
comes  to  the  attention  of  A/PM.  If  it  is  a  page  fault,  A/PM  saves  the  state  of  the  current  process,  and 
invokes  the  VMPageFault  primitive  of  TDVM.  If  a  feist  page  reclaim  is  possible,  TDVM  handles  it 
immediately  and  returns  the  pid  of  the  current  process,  so  that  A/PM  can  continue  its  execution. 
Otherwise,  TDVM  initiates  a  page-in  operation,  and  calls  Schedule  to  determine  which  process  should 
be  run  while  waiting  for  the  page-in  to  complete.  Note  that  since  the  faulting  process  has  not  been 
Descheduled,  it  will  still  appear  in  the  pid-list  returned  by  Schedule,  but  it  will  be  ineligible  for 
selection  because  it  is  awaiting  completion  of  the  page-in  operation.  If  no  appropriate  process  is 
available,  BAD-PID  is  returned  to  A/PM,  indicating  that  the  processor  should  remain  idle  at  this  time. 
Otherwise,  the  address  space  of  the  selected  process  is  activated,  using  ASActivate,  and  its  pid  is 
returned  to  A/PM  in  order  to  set  it  running. 


^The  hold  flags  are  ratumecf  along  with  the  pids  in  pid-list.  They  indicate  which  processes  are  not  to  be  run  at  present, 
either  because  they  ore  still  "sleeping",  or  because  there  is  a  penalty  lor  running  them  too  soon. 


(1)  Fault  ^ 

A/PM 

(2) 

VMPagaFaul t 

(5)  ASActIvate  ^ 

1 

/ 

(n 

pid 

(3)  Schedule 

- - 

_ _ 

TDVM 

- - 

- - — — — 

(6)  val 

(4)  pid-iiat 

Figure  3-41:  Page  Fault  Sequence 


The  sequence  of  events  which  occurs  upon  the  completion  of  a  page-in  or  swap-in  operation  is 
illustrated  in  Figure  3-42.  TDVM  is  notified  of  such  an  event  through  a  new  "Request  for  Work"  from 
its  Page-In  Worker  process  (see  Figure  3-34).  In  response  to  this,  TDVM  initiates  the  standard 
(preemptive)  rescheduling  sequence  (as  discussed  above),  by  sending  a  Reschedule  .signal  to  A/PM. 
Note  that  the  process  for  which  the  page-in  or  swap-in  operation  has  just  been  completed  should  still 
appear  in  the  pid-list  returned  by  Schedule,  but  now  it  will  be  eligible  for  selection  as  the  process  to 
be  run  next. 


(t) 

PagsStt 

Rt«d 

Complattd 


Figu  re  .?-42:  Page-In  or  Swap-In  Completion  Sequence 


271 


3.1 1  System  Monitoring  and  Debugging  Subsystem 

The  system  monitoring  and  debugging  subsystem  provides  various  abilities  to  monitor  and  control 
behavior  of  cooperating  eirobjects  and  processes  during  the  execution  time.  The  monitoring  and 
debugging  manager  exists  on  each  node  and  has  a  special  privilege  to  freeze  or  unfreeze  an  activity 
of  the  arobject  or  process. 

3.11 .1  Monitoring  and  Debugging  Management 

The  actual  operations  for  the  monitoring  and  debugging  operations  are  performed  in  the 
arobject/process  or  communication  subsystems.  For  instance,  Freeze,  Unfreeze,  Fetch,  Store 
operations  on  an  arobject/process  is  performed  at  the  arobject/process  manager.  However,  Freeze 
and  Unfreeze  operations  for  a  specific  node  or  for  all  applications  are  initiated  at  the  monitoring  and 
debugging  manager.  Similarly,  monitoring  on  the  communication  activity  is  supported  by  the 
communication  subsystem. 

The  following  ArchOS  primitives  are  supported  for  the  monitoring  and  debugging  management  for  a 
client. 


val  s  FreezeAllApplicatlonsO 
val  =  UnfreezeAllAplicatonsO 
val  =  FreezeNode(node-id) 
val  s  UnfreezeNode(node-id) 

BOOLEAN  val  TRUE  if  the  primitive  was  executed  properly;  otherwise  FALSE. 

NODE-ID  node-id  The  node  id  indicates  the  actual  node  which  will  be  stopped. 


A  FreezeAliApplications  primitive  .stops  the  entire  activities  of  caller’s  application,  and  a  FreezeNode 
primitive  halts  all  of  the  client's  activities  in  a  specific  node.  To  resume  client's  application, 
UnfreezeAU Applications  or  Unfre^zeNode  will  be  used. 

3.1 1 .2  Monitoring/Debugging  Protocol 

When  a  system-wide  Monitoring/Debugging  request  such  as  FreezeAliApplications  is  issued,  a 
Monitoring/Debugging  Manager's  worker  becomes  a  coordinator  among  the  other 
Monitoring/Debugging  Managers  and  propagates  the  request  to  the  others. 


272 


References 

[Bernstein  83]  P.  A.  Bernstein,  Goodman,  N.;  Hadziiacos,  V. 

Recovery  Algorithms  for  Database  Systems. 

Technical  Report  TR-10-83,  Harvard  University,  1983. 

[Comer  79]  Comer,  D. 

The  Ubiquitous  B-tree. 

ACM  Computing  Surveys  11(2);121-137,  June,  1979. 

[Conway  67]  Conway,  Maxwell,  and  Miller. 

Theory  of  Scheduling. 

Addison -Wesley,  1967. 

[Denning  68]  Denning,  P.J. 

The  Working  Set  Model  for  Program  Behavior. 

Communications  of  the  ACM  11(5);323-333,  May,  1968. 

[Denning  80]  Denning,  P.J. 

Working  Sets  Past  and  Present. 

IEEE  Transactions  on  Software  Engineering  SE-6(1);64-84,  January,  1980. 

[Eswaran  76]  Eswaran,  K.  P.,  J.  N.  Gray,  R.  A.  Lorie  and  I.  L.  Traiger. 

The  Notions  of  Consistency  and  Predicate  Locks  in  a  Database  System. 

Comm.  ACM  19(1 1),  November,  1976. 

[Garey  79]  Garey,  M.  R.;  Johnson,  D.  S. 

Computers  and  Intractability:  A  Guide  to  the  Theory  of  NP-Completeness. 

W.  H.  Freeman,  San  Francisco,  1979. 

[Gray  81]  Gary,  J.  N. 

The  Trasaction  Concept;  virtues  and  limitations. 

Technical  Report, ,  1981. 

[Jensen  83a]  E.  Douglas  Jensen. 

Decentralized  System  Control, 
may,  1983 

Final  Report  for  RADC,  Contract  Number  F30602-81  •C-0297,  Department  of 
Computer  Science,  Carnegie- Mellon  University. 

'’Jensen  83b]  E.  Douglas  Jensen,  et  al. 

Reconfigurable  C^  DDP  System. 

November,  1983 

Technical  Proposal  to  RADC,  PR  No.  B-4-3137,  Department  of  Computer  Science, 
Carnegie-Mellon  University. 

[Jensen  84]  Jensen,  E.  D.  and  Pleszkoch,  N. 

ArchOS;  A  Physically  Dispersed  Operating  System  --  An  Overview  of  its  Objectives 
and  Approach. 

IEEE  Distributed  Processing  Technical  Committee  Newsletter,  Special  Issue  on 
Distributed  Operating  Systems  ,  June,  196.  >. 


273 


[Jensen  85] 

[Liu  73] 

[Locke  85] 

[Mitchell  82] 

[Moss  81] 

[Sha85] 

[Sturgis  80] 

[Tokuda85] 


E.  Douglas  Jensen,  H.  Tokuda,  et  al. 

Functional  Description  of  ArchOS  (Archons  Operating  System). 

January,  1985 

Technical  Report  for  RADC,  Contract  No.  F30602-84-C-0063,  Department  of 
Computer  Science,  Carnegie-Meilon  University. 

Liu,  C.  L.;  Layland,  J.  W. 

Scheduling  Algorithms  for  Multiprogramming  in  a  Hard-Real-Time  Environment. 
Journal  of  the  Association  for  Computing  Machinery  20(1):46-61 ,  January,  1973. 

C.  Douglass  Locke,  Tokuda,  H;  E.  Douglas  Jensen. 

A  Time-driven  Scheduling  Modle  for  Real-Time  Operating  Systems. 

To  appear  in  Proceedings  of  Real-time  Systems  Symposium  ,  1985. 

Mitchell,  J.,  and  Dion,  J. 

A  Comparison  of  Two  Network-Based  File  Servers. 

Communications  of  the  ACM  25(4):233-245,  April,  1982. 

Moss,  J.  E.  B. 

Nested  Transactions:  An  Approach  to  Reliable  Distributed  Computing. 

PhD  thesis,  Massachusetts  Institute  of  Technology,  April,  1981. 

Sha,  L. 

Modular  Concurrency  Control  and  Failure  Recovery  -  Consistency,  Correctness 
and  Optimality, 

PhD  thesis,  Carnegie-Mellon  University,  1985. 

Sturgis,  H.,  Mitchell,  J.,  and  Israel,  J. 

issues  in  the  Design  and  Use  of  a  Distributed  File  System. 

Operating  Systems  Review  14{3).'55-69,  July,  1980. 

Tokuda,  H. 

A  Compensatable  Atomic  Objects  in  Object-oriented  Operating  Systems. 

To  appear  in  Proceedings  of  Pacific  Computer  Communicaiton  Symposium  ,  1985. 


[Wulf  81] 


Wulf,  W.  A.;  Levin,  R.;  Harbison,  S.  P. 
HYDRA/C.mmp:  An  Experimental  Computer  System. 
McGraw-Hill,  1981. 


274 


4,  Resource  Management  Theory 


275 


4.1  Best  Effort  Decision  Making  Overview 


276 


Best  Effort  Decision  Making  Overview 

1  Best  Effort  Decision  Problem  Statement 

The  purpose  of  any  operating  system  is  to  present  a  virtual  computer  interface  to  a  set  of  application 
programs  such  that  the  user  can  effectively  utilize  the  resources  of  the  computer  to  solve  a  particular  problem. 
'Fhis  virtual  computer  has,  as  its  goal,  a  presentation  of  the  resources  of  the  underlying  real  computer  in  a 
form  such  that  they  can  be  efficiendy  and  correctly  used. 

In  a  uniprocessor-based  system,  the  management  of  resources  can  be  performed  taking  into  account  the 
complete  state  of  the  system,  since  it  is  generally  assumed  that  the  state  is  entirely  available  at  a  single 
location.  Algorithms  for  the  management  of  resources  in  such  systems  have  been  extensively  investigated, 
and  the  solutions  are  embodied  in  many  existing  systems. 

In  the  case  of  a  distributed  system,  the  resources  arc  also  distributed,  and  the  knowledge  of  the  system  state 
is  not  available  at  any  one  place.  In  fact,  the  very  concept  of  an  instantaneous  state  of  the  resources  is 
essentially  meaningless,  since  such  a  state  is  indeterminable  at  any  individual  node  of  an  asynchronous 
distributed  system.  The  reason  for  this  uncertainty  is  twofold; 

1.  There  is  a  non-negligible  and  unpredictable  delay  in  the  transfer  of  information,  including 
resource  allocation  state  information,  from  one  node  (processor)  of  a  distributed  system  to 
another.  In  particular,  this  delay  is  large  in  comparison  to  local  node  memory  access,  so  that  the 
transmission  delays  are  large  in  comparison  with  the  rate  at  which  the  system  state  changes. 

2.  The  possibility  of  an  intemode  communications  transmission  error  is,  while  arbitrarily  reducible, 
non-negligible,  so  that  steps  must  be  taken  to  detect  and  correct  errors  during  communications. 

Due  to  the  presence  of  such  errors,  it  is  quite  possible  that  tlic  communications  medium  will 
introduce  uncertainties  with  respect  to  messiige  origination  time  and  sequencing.  Attacks  on  the 
problem  of  error  handling  principally  result  in  the  introduction  of  further  (unbounded)  delay,  as 
the  communication  system  corrects  transmission  errors. 

The  rcsulUint  impossibility  of  ascertaining  the  global  system  state  of  a  disti  ibuted  system  clearly  renders 
impossible  the  globally  optimal  allocation  of  system  resources  to  user  processes.  Even  though  global  op¬ 
timality  cannot  be  achieved,  however,  we  can.  and  will,  dclinc  optimality  with  respect  to  whatever  infor¬ 
mation  is  available.  It  is  this  decision  making  process,  tlicn,  which  we  will  call  besl  effort  decision  making 
[Jensen  84). 

1.1  BostEflort  Decision  Making 

Best  effoit  decision  making  erainot  be  rigoiously  defined,  but  we  can  discuss  the  concept  in  mhuo  dct.nl. 
r.'scntially,  it  refers  to  tcchniqic.s  for  decision  making  utili/iiu;  iiicumplctc  ai  d  inaccuiale  infiimation  in  the 
bc'.t  po.vsih!';  way.  Teeliiiitiucs  f)i  determining  and  maximi.'ing  the  value  of  die  availaliie  iufomiaiioii, 
algoiilhiii.s  fur  Cl opentive  iut'  irude  decision  ni, iking,  aiul  iieurisiics  loi  dccisinn  se.'ucii  s’s  '  c  i.;(loctio!i  .U'’ 
among  the  eomi-oiionf:;  ot'bo-,t  cdort  deei  aon  tri  iking. 


277 


It  is  our  thesis  that  because  of  the  impossiblilty^  of  a  globally  optimal  solution  to  decentralized  resource 
allocation,  resource  management  decisions  should  be  made  using  the  best  information  available,  making  a 
best  effort  to  determine  its  meaning.  We  believe  there  are  several  potential  techniques  (see  Section  2.5.1) 
which  can  be  used  to  evaluate  information  in  an  effort  to  arrive  at  a  decision  once  the  maximum  value  has 
been  extracted  from  the  available  information.  The  goal  of  best  effort  decision  making,  in  this  context,  is  to 
make  resource  allocation  decisions  as  efficiently  as  possible  in  the  presence  of  incomplete  and  inaccurate  state 
information. 

A  best  effort  decision,  with  its  prestated  globally  sub-optimal  nature,  clearly  can  be  implemented  using  any 
of  a  number  of  diverse  approaches.  The  approaches  to  be  considered  in  this  research  will  cover  a  spectrum 
including,  but  not  neccessarily  limited  to,  applied  AI,  decision  the.'  ry,  and  non-monotonic  logic, 

1.2  Real-Time  Deadline  Management 

Such  a  study  will  be  of  limited  value,  however,  in  a  vacuum,  so  we  will  study  a  particular  problem,  and 
several  possible  best  effort  decision  making  approaches  will  be  made  to  solve  it,  accompanied  by  evaluations. 
The  particular  problem  to  be  studied  will  be  that  of  deadline  scheduling  in  a  real-time  distributed  system. 

An  operating  system  managing  a  real-time  application  shares  a  large  number  of  functions  with  a  normal 
timesharing  operating  system,  but  differs  most  significantly  in  one  principal  area— the  management  of  time 
deadlines.  The  typical  time  sharing  operating  system  manages  a  number  of  independent  applications, 
promoting  some  concept  of  fairness  (defined  by  the  sytstem  administrator)  among  them.  In  tlic  real-time 
system,  a  single  application^uses  the  system  resources  to  solve  a  single  problem,  and  includes  a  set  of  dead¬ 
lines  which  must  be  met  in  order  to  satisfy  its  specifications.  We  distinguish  /lard  real-time  systems  from  sq/i 
real-time  systems  in  this  document;  the  use  of  the  word  hard  implies  that  a  missed  deadline  corresponds  to  a 
major  system  failure,  while  a  missed  deadline  in  a  soft  real-time  system  is  not  necessarily  a  major  failure, 
depending  on  the  nature  of  the  deadline,  although  a  specification  has  been  violated.  This  research  will  use  a 
computational  model  encompassing  both  soft  and  hard  real-time  systems. 

The  research  which  we  undertake  is  to  study  deadline  management  m  a  distributed  resource  sharing 
environment  in  wltich  the  system  state  is  incomplete  and  inaccurate;  in  which  tlic  time  allocation  decisions 
must  be  made  using  a  best  effort  approach.  In  fact,  for  dais  probiem  a  best  effort  approach  must  be  used 
bccau.se  of  the  built-in  uncertainties  even  if  complete  and  accurate  global  system  suite  infonnation  were 
available,  because  of  the  stochastic  n.'iturc  of  die  information  .available  about  sclicdulable  processes  ami  the 
limited  decision  making  time  available  in  a  real-time  system. 

inir.nc!;ibiUly,  cvl’u  IT i^lfibal  ^laic  infomi.aion  were  av.iil;ililc -  see  :.ci lion  ?,.4 

MPi;,lc  •.’.ppliealion  (iie.'ips  a  sel  of  proie-sc,.  v  oiAing  tOPL'lhcr  an.l  ;.h:iriii(.-  ic-vmiees  lowaril  .-i  LOiniiin'i  '.Vhilc  it  lor 

more  iM;\n  ..■no  ..iitli  .ti’iplr  i-'  run  ayr.'uri  oily  'ii  n  .'ciillin;.'  v,  .'.'in,  H  i'.  mmIIv  line  ih.ii  ihc  ie:'l  lime-  .  i  l  ;!■.  '.si.ni 

Will  l.e  in  iC.'  '.;i!y  fj  one:  p:''.  nil’Ci':  ".'ill  .  .".ei  iM,  ri  iie.r';;  roi.iml  .••.ii|ili(.;itionS. 


278 


A  considerable  amount  of  research  has  been  done  when  deadlines  could  be  met,  but  relatively  little  infor¬ 
mation  is  available  about  scheduling  decisions  when  available  resource  limitations  require  that  one  or  more 
deadlines  cannot  be  meL  Our  approach  will  be  designed  to  maximize  the  value  of  the  available  state 
information  to  make  the  deadline  scheduling  decisions,  particularly  in  those  cases  where  deadlines  cannot  be 
met 

Questions  of  policy/mechanism  separation  [Wulf  81]  are  also  related  to  this  problem.  If  one  or  more 
deadlines  must  be  missed,  which  deadlines  should  be  selected?  Clearly  a  policy  decision  must  be  made,  and 
the  range  of  potential  policies  for  this  decision  will  be  determined,  with  the  required  mechanisms  defined, 
implemented,  and  evaluated.  It  is  these  mechanisms  which  will  reflect  the  best  effort  approach,  and  their 
support  of  the  potential  policies  will  be  analyzed. 

2  Motivation 

While  it  is  obvious  that  resource  management  is  the  fundamental  problem  in  the  development  of  any 
oerating  system,  it  is  equally  obvious  that  this  is  especially  true  in  a  distributed  system  [Applewhite  82]. 

The  difference  between  best  effort  and  algorithmic  (either  consensus  or  autonomous)  decision  making  in 
operating  system  resource  management  can  be  illustrated  with  decisions  on  the  use  of  resources  in  a  one- 
person  business  versus  a  large  corporation.  In  the  small  business,  resources  such  as  capital  and  raw  materials 
are  all  under  the  control  of  a  single  individual,  and  their  current  state  is  generally  knowable  and  therefore 
manageable  by  that  one  individual  in  an  optimum  manner.  In  the  large  corporation,  attempts  to  completely 
centralize  the  resource  allocation  decisions  arc  generally  disastrous.  It  is  well  known,  however,  that  the  larger 
business  can,  at  least  in  principle,  make  more  efficient  use  of  its  resources  in  spite  of  its  seeming  disadvantage 
with  respect  to  the  optimality  of  its  resource  allocation.  Similarly,  if  we  can  properly  perform  best  effort 
resource  allocation  in  a  distributed  system,  the  overall  utilization  of  the  system’s  resources  should  be  more 
effective  than  in  a  similar  set  of  uniprocessors. 

The  question  naturally  arises;  why  do  almost  all  existing  distributed  systems  consist  of  a  set  of  autonomous 
systems  containing  communicating  processes,  in  which  resources  are  managed  by  each  node  locally  for  the 
benefit  of  its  uv/n  processes,  ratlicr  than  as  a  single  computational  entity,  utilizing  all  resources  in  the  best 
interests  of  the  system?  While  there  arc  many  answers  to  this  question,  the  difficulty  and  importance  of 
decision  making  is  certainly  one  of  the  significant  reasons.  I'his  is  exactly  the  motivation  for  this  research;  it  is 
our  thesis  that,  as  in  the  case  of  the  large  business,  a  distributed  computer  which  makes  best  effort  decisions  in 
allocating  all  system  resources  for  the  benefit  of  the  system  will  be  better  able  to  apply  those  resources  to 
fullill  the  user's  objectives. 


279 


3  Related  Research 

This  area  of  research  is  the  combination  and  extension  of  a  number  of  previous  efforts.  As  a  result,  there 
arc  several  areas  of  research  which  can  be  considered  related  to  this  work.  Here,  we  will  describe  research  in 
five  related  areas: 

•  Artificial  Intelligence — particularly  techniques  for  knowledge  representation  and  rules  of  in- 
ference  related  to  decision  making  (planning). 

•  Non-monotonic  Logic — studies  related  to  the  handling  of  multi-valued  logic,  fuzzy  logic  (truth 
assertions  with  respect  to  fiizzy  sets),  and  associated  truth  determination. 

•  Decision  Theory— yioTk  related  to  the  process  of  combining  sets  of  observed  states  of  nature,  the 
probabilities  associated  with  these  states  in  the  event  of  a  given  set  of  decisions,  and  the  ordering 
of  possible  decisions  based  on  this  information  (e.g.,  Bayesian  theory). 

•  Real-Time  Deadline  Management— reseeiTch  on  deadline  scheduling  in  a  real-time  system. 

•  Distributed  Decision  Making— t^oxts,  toward  the  problem  of  reaching  a  consensus  in  a  distributed 
system  in  the  presence  of  failures. 

3.1  Artificial  Intelligence 

Little  or  no  work  has  been  performed  in  the  application  of  knowledge  collection,  knowledge  representation, 
and  the  resulting  inferences  to  the  problem  of  scheduling,  load  balancing,  or  process  reconfiguration.  It 
seems  that  these  areas  could  be  critical  to  the  efficient  handling  of  tlrese  decisions.  Stefik  [Stefik  82] 
delineates  a  number  of  approaches  to  problems  involving  expert  systems,  outlining  the  primary  techniques 
used  in  developing  them  based  on  the  overall  system  type  (e.g.,  planning,  diagnostic,  etc.).  This  problem 
shares  a  number  of  the  aspects  of  planning  problems,  in  that  the  solution  must  evaluate  the  data  determining 
its  quality  and  meaning,  search  the  space  of  possible  scheduling  decisions  estimating  the  result  of  each 
evaluated  decsiion,  and  make  a  decision  which  is  adequately  close  to  optimal.  It  is  expected  that  at  least  some 
of  tlic  techniques  involved  in  the  design  of  expert  planning  systems  will  be  applicable  to  the  problem  of 
applying  a  best  effort  to  the  real-time  process  scheduling  problem,  particularly  in  tlic  areas  of  data  under¬ 
standing  and  knowledge  and  inference  rule  representation. 

3.2  Mon-monotonic  Logic 

Decision  making  is  dependent  on  a  determination  of  liic  stale  of  the  relevant  world  as  assessed  by  tlie 
decision  maker.  The  tlccision  to  be  made  can  be  viewed  as  a  consequence  (deduction)  from  this  perceived 
state,  which  can  be  represented  as  a  set  of  predicates.  Normal  formal  logic,  from  which  inference  rules  would 
be  taken  to  produce  such  licductions,  can  be  described  as  inonotonic  in  live  sense  that  the  addition  of  any  new 
a,xio[n  (i.c.,  an  additional  observatinn  of  state)  which  is  not  in  conflict  with  e.tisling  axions,  can  never  in¬ 
validate  a  picviously  drawn  conclusion.  In  a  situation  eli.ir-actcri/ed  l)y  a  large  quantity  of  inconi|;lctc  or 
inacenmfe  informaii<M),  such  dediadions  .may  later  be  invalidaiod  by  the  presence  of  new  information,  l  ogic 


280 


in  which  inferences  may  be  drawn  using  such  data  is  called  non-monotonic  logic  [McDermott  80][McDermott 
82],  and  is  generally  characterized  by  the  use  of  a  logical  operatior  meaning  "is  consistent  with"  on  conclusions 
drawn. 

In  another  form  of  monotonic  logic,  inferences  which  can  be  described  only  as  "consistent"  with  perceived 
state  information  can  be  handled  by  the  use  of  Juzzy  sets.  Normal  set  theory  deals  with  sets  whose  boundaries 
are  clear-cut;  an  element  is  either  a  member  of  a  given  set  or  it  is  not.  Members  of  fuzzy  sets  [Kaufmann  75] 
have  an  associated  membership  value  which  determines  the  degree  of  membership  of  that  member  in  the  set. 
Fuzzy  logic  refers  to  the  mainipulation  of  predicates  with  fuzzy  implications  (e.g.,  high  value  processes  should 
be  executed  early).  Discussions  of  such  logic  inference  rules  are  given  in  [Zadeh  79]. 

3.3  Decision  Theory 

There  are  several  mathematical  techniques  developed  to  construct  decisions  based  on  the  opinions  (inexact 
measurements)  of  a  number  of  co-decision  makers.  Stankovic  [Stankovic  83]  describes  a  heuristic  for  job 
assignments  in  a  decentralized,  dsitributed  environment  using  Bayesian  decision  theory  [Jeffrey  84],  while 
DeGroot  describes  iterative  solutions  given  a  matrix  of  opinion  weights  and  a  vector  of  opinions  [DeGroot 
74], 

Bayesian  decision  theory  may  be  used  in  an  environment  in  which  a  decision  must  be  made  whose  value  is 
determinable  only  when  evaluated  in  tlie  light  of  an  unpredictable  future  global  state  (such  as  determining 
whether  to  plan  a  picnic  for  Saturday).  The  decision  maker  first  constructs  a  matrix  witli  each  row  for  a 
potential  decision  and  each  column  for  a  potential  relevant  state,  placing  in  each  matrix  element  a  value  of  the 
utility  or  dcsircability  of  having  made  that  decision  if  die  corresponding  state  occurred.  Then  the  decision 
maker  constructs  a  similar  matrix  containing  the  probability  that  that  state  will  actually  occur  (possibly 
conditional  on  the  decision  to  be  made).  'ITic  row-wise  sum  of  the  matrix  produced  by  multiplying  each 
element  of  the  utility  matrix  by  the  corresponding  element  of  the  probability  matrix  defines  a  vector 
representing  the  optimal  ordering  of  the  set  of  potential  decisions. 

3.4  Real-Time  Deadline  Management 

Rcscarcli  into  the  general  scheduling  problem  has  progressed  for  a  long  time  as  a  part  of  operations 
research,  where  it  has  found  application  in  tlie  scheduling  of  such  activities  as  manufacturing  production. 
Graves  [Gi  ivcs  81]  has  provided  an  excellent  synopsis  of  diis  background,  including  a  taxonomy  of  produc¬ 
tion  scheduling  pri)blcms.  He  identifies  four  levels  of  complexity  with  respect  to  scheduling  decisions; 
tractable  solutions  have  been  identified  for  only  tltc  simplest  of  these  levels  w'hile  tlie  rest  liavc  been  proved  to 
be  NP-eoniplete  or  NP-hard. 

In  particular,  the  level  of  complexity  most  similar  to  the  problem  to  be  considered  in  this  rcseaich,  soft-rc.il- 
timc  deadline  ‘ahcdidinp  v.ilb  coiisumt  value  fuitetiuns  lor  met  dvCKlIiiK"',  has  been  proved  to  be  NP- 


281 


comnplcte  [Kaip  72).  Algorithms  for  this  problem  have  been  described  [Sahni  76]  both  for  optimum  schedul¬ 
ing  (obviously,  in  exponential  time)  and  for  a  heuristic  approach  for  which  an  optimal  solution  can  be  missed 
by  a  determinable  factor  in  0(/r^  time.  The  model  which  we  are  using  to  define  the  scheduling  problem  to 
be  handled  includes  this  problem  as  a  special  case. 

In  a  classic  paper,  Liu  and  Layland  [Liu  73]  show  that  the  simple  deadline  scheduling  problem  with  a  set  of 
independent  periodic  processes  whose  computation  times  are  exactly  known  and  identical  for  every  period, 
whose  deadlines  are  equal  to  their  period,  and  which  are  running  in  a  single  processor,  is  solvable  with  a 
simple  algorithm,  and  that  the  schedulability  of  such  a  simple  system  is  easily  determinable  in  advance. 
While  this  result  is  interesting,  the  cases  covered  are  much  too  unrealistic  to  be  directly  applicable  to  our 
problem. 

3.5  Distributed  Decision  Making 

The  Byzantine  Generals  Problem  [Lamport  82][Dolev  82][Lynch  82]  is  the  name  given  to  tlie  problem  of 
reaching  a  consensus  on  a  decision  in  an  environment  in  which  some  of  the  decision  makers  have  failed  (or 
may  even  be  antagonistic  to  the  making  of  a  correct  decision).  It  has  been  proved  that  in  a  fully  connected 
network  of  decision  makers  [Lamport  82],  a  correct  decision  can  be  effected  as  long  as  more  than  two  thirds  of 
the  decision  makers  have  not  failed.  The  algorithms  presented,  however,  while  guaranteed  to  produce  a 
correct  solution,  require  full  synchronizadon  of  all  decision  makers,  and  require  a  large  number  of  messages. 
It  should  be  noted  that  in  a  completely  asynchronous  system  (i.e.,  one  with  no  complete  event  ordering  such 
as  a  clock),  no  Byzantine  agreement  is  possible  in  the  presence  of  any  failure  [Fischer  82]. 


► 


282 


References 

(Allcliiii  82j  .Allchin.  J.imos  11.  .md  M.min  S.  McKcndry. 

Ohurl  liascil  S\m  hniniziiiii>n  mut  Hrcnvery. 

1  cchnicjl  Report  (il  l  -ICS-S2/15.  ScIkm)!  of  Infonnation  and  Computer  Science.  Georgia 
Institute  ofrcciiiU)logy,  1982. 

[Attar  84]  AtLir,  R..  Ikrnstein  P.  A.  and  Gotxlman  N. 

Site  lnitiali/.ation.  Recovery  and  Ikickup  in  a  Distributed  Database  System. 

//•.7;7‘.'  TransacHun  nn  Sojiware  I'.ti^iiicering ,  Nov..  1984. 

[Bentley  83]  Bentley.  J.  i ...  Johnson.  D.  S..  I  cighton.  T.  and  McGeoch.  C.  C. 

An  l\xperienicni(il  Stiuly  of  liin  tracking. 

I  cchnical  Report.  Bell  I  .aboraturies.  Murray  Hill.  N.J.  07974. 1983. 

[Bernstein  83]  Bernstein,  P.  A.,  Goodman;  N.  and  Had/iacus  V. 

Recovery  Algorithms  for  Daubase  Systems. 

Technical  Report,  Aikcn  Compulation  lAiboratory,  Harvard  University ,  March  1983. 

(Chu  80]  Chu.  W.  W.;  I  lolloway.  I ..  j.;  ( .an.  M.;  Ffe.  K. 

Task  Allocation  in  I  )istiibutcd  Data  PoKessing. 

Computer  IH  \  l);57-()9,  November.  1980. 

(Chvatal  83]  Chvatal.  V. 

Linear  Programming. 

W.  H.  Freeman  and  Company,  1983. 

[Coffman  83]  Coffman  Jr.,  E.  G..  Garcy.  M.  R.  and  Johnson,  D.  S. 

Approximation  Algorithms  for  Bin  Packing  -  An  Updated  Survey. 

Technical  Report.  Bell  Laboratories,  Murray  Hill,  N.  J.,  1983. 

[DcGroot  74]  DeGroot,  M.  H, 

Reaching  a  Consensus. 

Journal  of  the  American  Statistical  Association  69(34 5):1 18*121,  March,  1974. 

[Dolev  82]  Dolev,  D. 

The  Byzantine  Generals  Strike  Again. 

Journal  of  Algorithms  3(1):  14-30,  March,  1982. 

[Eswaran  76]  Fswaran,  K.  P.,  J.  N.  Gray,  R.  A.  Loric  and  1.  L.  Traiger. 

The  Notion  of  Consistency  and  Predicate  Lock  in  a  Database  System. 

CACM.im. 

[Fischer  82]  Fischer,  M.  J.;  Lynch,  N.  A.;  Paterson,  M.  S. 

Impossibility  of  Distributed  Consensus  with  One  Faulty  Process. 

Technical  Report,  Massachusetts  Institution  of  Technology,  September,  1982. 

[Frederickson  80]Frederickson,  G.  N. 

Probabilistic  Analysis  for  Simple  One  and  Two  Dimensional  Bin  Packing  Algorithms. 
Information  Processing  Letters,  Vol.ll,  No.  4,  5. ,  Dec.  1980. 

[Garcia-Molina  83] 

Garcia-Molina,  H. 

Using  Semantic  Knowledge  For  I'ransaction  Processing  In  A  Distributed  Database. 

ACM  Transaction  on  Database  Systems,  VolS,  No.  2 ,  June,  1983. 


[(iravcs  7()|  Cir.ivcs,  (i.  W.  .iiul  Wliinslon.  A.  II. 

All  Algol  ilhiii  foi  riic  Oii.iilr.ilic  Assigiiincnl  IVobIcin. 

Si  lent  c.  l  iil.  17.  Ao.  7.  Mar.  l‘)70. 

(Graves  81]  Graves.  S.C. 

A  Review  of  I’rodiielioii  Scheduling. 

Opcrciioiis  RfSt  an  h  2*)(‘^):^>46-675.  July-August,  1981. 

[lliilicrSO]  I  lillicr,  I-'.  S.  and  l.iebcmian,  G.  J. 

/in  fntnkJuclion  lo  OpenUiuns  ResearcK-3rd  Edition. 

Holden- Day  Inc.,  1980. 

(Jeffrey  84]  Jeffrey,  R.  C. 

The  Logic  of  Decision.  Second  Edition. 

University  of  Chicago  Press,  Chicago  and  l.ondon,  1984. 

(Jensen  83]  Jensen.  K.  D. 

ITie  Archons  Project;  An  Overview. 

In  Proceedings  of  the  International  Symposium  on  Synchronization.  Control,  and  Com- 
ntunicaiion  in  Distributed  Systems,  pages  31*35.  Academic  Press,  1983. 

(Jensen  84]  Jensen,  li.  D. 

ArchOS:  A  Pliysically  Dispersed  Operating  System. 

IL'EE  Distributed  Processing  Technical  Committee  Newsletter ,  June,  1984, 

[Kaku  83]  K. !  u,  B.  K.  and  Thompson.  G.  L. 

An  Exact  Algorithm  for  The  General  Quadratic  Assignment  Problem. 

Technical  Report.  Graduate  School  of  Industrial  Administration,  Camegie-Mellon  Univer¬ 
sity,  Mar.  1983. 

(Karp  72]  Karp,  R.  M, 

Reducibility  among  Combinatorial  Problems. 

Complexity  of  Computer  Computations. 

Plenum  Pi^.  New  York,  1972,  pages  397-411. 

[Kaufinann  75]  Kaufmann,  A. 

Introduction  to  the  Theory  of  Fuzzy  Sets 
Academic  Press,  New  York,  1975. 

(Klcinrock  76]  Klcinrock,  L 

Queueing  Systems  Vol  II,  pp  424  -  425. 

John  Wiley  and  Sens,  1976. 

(Korth  83]  Korth,  H.  F. 

Locking  Primitives  in  a  Database  System. 

JACM,  Vol.  30.  No.  I ,  January,  1983, 

(Kung  79]  Kung,  H.  T.  and  C.  H.  Papadimitriou. 

An  Optimal  I  hcory  of  Concurrency  Control  for  Databases. 

In  Proceedings  of  the  SIGMOD  International  Conference  on  Management  of  Data,  pages 
116-126.  ACM,  1979. 

(Lamport  82]  Lamport.  L.;  Shostak,  R.;  Pease,  M. 

The  By/aiuine  Generals  Problem. 

ACM  Transactions  on  Programming  Languages  and  Systems  4( }):3i2-4Ql,  July ,  1982. 


284 


[I  csscr  7.1|  I  nri..ii.  I  .  I )..  I  cmicll.  K.  I )..  I  cssor.  V.  K..  .iml  KciUly.  I ),  K. 

Systi'iii  ()ru;ini/jtitins  titr  Speech  LliiJerslaiuJing. 

/'/((•  I  liinJ  InlirnalKiHiil  Jiiiiil  (  '(in/cmu  r  on  iriijh  iiil  liiu  llifii  iu  i’ .  Aiiuiisl 

(I  .iskov  85)  I  .iskov.  H.  .md  Wcihl  W. 

Sprcilkaiion  o/DiMrihuicJ  Prof^mins. 

Technical  Keport,  M.l.  I'..  SepL  1985. 

[Liu  73)  I -iu,  C.  I I  xiyland,  J.  W. 

Scheduling  Algorithms  for  Multiprogramming  in  u  I  lard-Rcal-Timc  LnvironmcnL 
Journal  of  the  Assuciation  for  Computing  Machinery  2l)(l):46-6l,  January,  1973. 

[Lynch  82)  Lynch.  N.  A.;  Fischer,  M.  J.;  ^■owlcr.  R.  J. 

A  Simple  and  FfTiciciu  My/aniinc  Generals  Algorithm. 

In  Proceedings  of  the  Second  Symposium  on  Reliability  in  Distributed  Software  and  Datalmse 
pages  46-52.  ACM-IFKF.,  July  19-21,  1982. 

[Lynch  83]  Lynch,  N.  A. 

Multi-level  Atomicity  -  A  New  Correctness  Criterion  for  Database  Concurrency  Control. 
ACM  Transaction  on  Database  Systems,  Vol.  8,  No.  4 , 13cccmbcr,  1983. 

[Marschak  72]  Marschak,  J.  and  Radner,  R. 

economic  Theory  of  Teams. 

Yale  University  Press,  1972. 

[McDermott  80)  McDermott,  D.;  Doyle,  J, 

Non-Monotonic  Logic  I. 

Artificial  Intelligence  13:41-72, 1980. 

[McDermott  82)  McDermott,  D. 

Non-Monotonic  Logic  II. 

Journal  of  the  Association  for  Computing  Machinery  29(l):33-57,  January,  1982. 

[Mohan  83]  Mohan,  C. 

Efficient  Commit  Protocol  for  The  Tree  Process  Model  of  Distributed  Transactions. 

IBM  Research  Report,  RJ 3881  (44078),  Computer  Science ,  June,  1983. 

[Mohan  85]  Mohan,  C..  Fusscll.  D..  Kedem  Z.  M.  and  Silberschatz  A. 

Lock  Conversion  in  Non-Two-Phase  Locking  Protocols. 

IEEE  Transaction  on  Software  Engineering^  Jan.,  1985, 

[Papadimitriou  84] 

Papadimitriou.  C.  H.  and  Kanellakis,  P.  C. 

On  Concurrency  Control  by  Multiple  Versions. 

ACM  Transaction  on  Database  Systems ,  Mar.,  1984. 

[Sahni  76]  Sahni.  S.  K. 

Algorithms  for  Scheduling  Independent  Tasks. 

Journal  of  the  Association  for  Computing  Machinery  23(1);  116-127,  January,  1976, 

[Schwarz  84a]  Schwarz,  Peter  M.  and  Alfred  Z.  Spector. 

Synchronizing  Shared  Abstract  Types. 

ACM  Transaction  on  Computer  Systems,  AUG.  1984. 

[Schwarz  84bJ  Schwart,  P. 

Transactions  on  Typed  Objects. 

PhD  thesis.  Department  of  Computer  Science,  Camegic-McUon  University,  1984, 


285 


ISh.i  85.ll  Sh.i,  I  . 

MuthtUir  ('i»u  urrt  nt  y  (  untni/  <///</  luilun-  Rnnvt  ry  -  (  onsisK’iuy.  (  Hrra  liias  ami 
Oplinuilily. 

I’hl )  ihesis,  Dcparimciu  of  lllccirical  .iiul  Compmcr  l-rn’inccrini;,  Camcjiic-Vlcllon  Univer¬ 
sity.  1985. 

[Sha  85b]  Sha.  I ...  I  .choc/ky.  J.  P.  and  Jensen  H.  D. 

Modular  Concurrency  Control  and  Failure  Recovery  Part  II;  l-ailure  Recovery. 
Suhmillcd  for  publication ,  1985. 

[Sha  85c]  Sha.  I...  I.ehoc/ky.  J.  P.  and  Jensen  K.  D. 

Modular  Concurrency  Control  and  Failure  Recovery  ™  Part  I;  Concurrency  Control. 
Submined  for  publication ,  1985. 

[Silbcrschatz  80]  Silbcrschau.  A.,  and  /.  Kedem. 

Consistency  in  Hierarchical  Database  Systems. 

JACM  27:1 . 1980. 

(Stankovic  83]  SLnnkovic.  J.  A. 

Bayesian  Decision  llicory  and  Its  Application  to  Decentralized  Control  of  Job  Scheduling. 
February,  1983. 

Internal  documentation.  University  of  Massachusetts.  Department  of  Electrical  and  Com¬ 
puter  Engineering. 

[Stcfik  82]  Stefik.  M.;  Aikins.  J.;  Balzer,  R.;  Benoit,  J.;  Bimbaum.  L:  Hayes-Roth.  F.;  Sacerdoti,  E. 
The  Organization  of  hxpert  Systems,  A  Tutorial. 

Artificial  Intelligence  18:135-173, 1982. 

(Svobodova  84]  Svobodova,  L 

Resilient  Distributed  Computing. 

IEEE  Transaction  on  Software  Engineering,  May,  1984. 

[Tokuda  85]  Tokuda,  H.,  Clark,  R.  K.  and  Locke,  C.  O. 

Archons  Operating  System  (ArchOS)  —  Client  Interface.  Part  II. 

Technical  Report,  Department  of  Computer  Science,  Camegie-Mellon  University,  1985. 

[Weihl  84]  Wcihl,  W.  E 

Specification  and  Implementation  of  Atomic  Data  Types. 

PhD  thesis,  Massachiisctts  Institute  of  Technology,  1984. 

[Wulf  81]  Wulf,  W.  A.;  Levin,  R.;  Harbison,  S.  P. 

HYDRA/C.mmp:  An  Experimented  Computer  System. 

McGraw-Hill,  Inc.,  1981. 

[Zadeh  79]  Zadeh,  L.  A. 

A  Theory  of  Approximate  Reasoning. 

Machine  Intelligence  9. 

Wiley.  New  York.  1979. 


286 


4.2  Best  Effort  Decision  Making  Theory 


287 


Best  Effort  Decision  Making  Theory 

1  introduction 

In  a  dccolUrali/cd  computer  system,  (iroccsscs  oCicn  function  co-operatively  as  a  team  to  enhance  system 
level  perftjrmancc.  One  basic  goal  of  the  .\rchons  project  of  which  this  research  is  part  is  to  develop  concepts 
and  mechanisms  which  are  intended  to  lunction  co-operatively  by  using  co-operative  decision  making.  Hicsc 
mechanisms  must  function  ciricicntly  even  ilunigh  they  will  have  both  inaccurate  and  incomplete  information 
upon  which  to  act  [jcnscn84|.  In  this  chapter,  we  introduce  a  framework  for  dealing  with  co-operative 
decision  making  problems  arising  in  dcccntrali/cd  systems  in  general  and  the  Archon  system  specifically. 
Concepts  drawn  from  queueing  tlicory.  team  decision  theory  lMarschak72]  .ind  infonnation  theory  will  ho 
used  to  analyze  the  value  of  system  status  information  for  rcst)urcc  management.  We  focus  specifically  on  die 
trade-offs  between  the  completeness  and  the  timeliness  of  information.  Rather  than  treating  the  consensus 
problem  abstractly,  we  will  single  out  a  particular  problem  for  study:  tltc  problem  of  decentralized  load 
balancing.  This  problem  will  nicely  illustrate  the  application  of  one  particular  paradigm  of  consensus  deci¬ 
sion  making  -  team  decision  theory.  Furthermore,  it  illustrates  the  trade-offs  among  the  cost  of  information 
(loading  information  at  each  processor),  its  completeness,  and  its  timeliness.  This  chapter  is  organized  as 
follows.  Section  .  2  provides  an  overview  of  decentralized  decision  making.  Secdoo  3  describes  the  load 
balancing  problem  and  its  formulation  os  a  consensus  decision  making  problem.  Specific  results  charac¬ 
terizing  the  value  of  system  sutus  information  are  derived.  Section  4  presents  some  ideas  for  further  study. 

2  Overview  of  Co-operative  Decision  Making 

In  decentralized  computer  systems,  the  processors  often  must  collaborate  to  solve  a  particular  system  or 
application  problem.  ITie  problems  may  range  from  those  associated  with  the  management  of  system 
resources  (such  as  deciding  which  processor  should  process  a  particular  task)  to  those  arising  in  a  specific 
application  such  as  speech  recognition.  We  propose  to  try  to  deal  with  these  problems  in  a  general  way  by 
casting  them  in  a  framework  of  statistical  decision  theory.  Once  in  such  a  framework,  it  is  possible  to  consider 
issues  such  as  the  importance  or  value  of  information  and  its  decline  in  value  with  time  delays. 

In  statistical  decision  theory,  tlierc  is  a  set  of  possible  states  one  of  which  is  the  "true"  state.  Generally,  it  is 
desired  to  identify  the  particular  laic  state.  Often  data  arc  available  to  assist  die  decision  maker  in  identifying 
the  true  state.  Tlicsc  dau  have  dilTorcnt  likelihoods  for  the  various  possible  states.  An  example  might  be  die 
load  balancing  problem  in  a  distributed  local  computer  network.  The  possible  states  represent  the  exact 
workloads  at  each  of  the  processors  at  dic  lime  dic  decision  is  made,  llic  data  arc  the  workload  information 


288 


mcs.s.igv.'s  cxclKingcd  l\v  the  pioccssop*.  I  Iicm;  mcss.igcs  .nc  incxjtl  and  delayed  in  liitie.  I  lie  paitieiilar 
decision  made  will  vary  in  giMidness  with  the  possible  stales  of  ilie  system. 

Many  decision  problems  arising  in  deecntrali/cd  decision  making  can  be  cast  in  a  suitistical  decision  theory 
fnimework.  Once  in  Uiis  framework,  various  solution  technique's  cun  be  brought  to  bear  on  the  problems.  A 
particularly  powerful  method.  Itayesian  decision  theory,  should  be  considered,  since  it  leads  to  optimtil 
decision  procedures.  In  the  Bayesian  fonnulation.  the  probability  distribution  is  specified  on  the  possible 
states.  Iliis  is  culled  a  prior  distribution.  Next,  we  must  determine  tlie  piobability  of  the  possible  states  with 
respect  to  observed  data,  which  is  called  die  likcliluxid  function.  ITicsc  two  components  result  in  a  posterior 
distribution  over  the  possible  sLitcs.  Hie  posterior  distribution  rcHccts  both  the  prior  information  and  the 
observed  data.  An  optimal  decision  or  action  with  respect  to  some  criterion  can  tlien  Ik  determined.  Fur- 
thennorc,  the  posterior  distribution  can  he  continuously  updated  in  real-time.  Unfortunately,  this  program 
can  often  not  be  carried  out  in  its  exact  form.  Two  fundamental  dilTicuIties  arise.  First,  die  prior  distribution 
and  likelihood  function  may  be  essentially  impossible  to  specify.  Second,  in  many  examples  die  Ikiyesian 
formulation  docs  not  consider  diat  daci  (for  example  prtxressor  SDlus  infonnation)  deteriorates  in  value  with 
time.  If  some  infonnation  is  delayed,  the  quality  of  the  decision  made  may  be  greatly  reduced. 

In  many  real  problems,  a  joint  likelihood  function  must  be  determined  empirically.  When  a  high  dimen¬ 
sion  likelihood  (unction  is  required,  an  empirical  approach  is  impossible  to  carry  out.  For  example,  in  a 
speech  recognition  context,  the  likelihood  ofa  particular  signal  being  a  particular  phrase  will  depend  on  many 
factors  ranging  from  its  context  in  thc'scntcnce  to  the  sound  of  the  actual  signal.  A  speech  recognition  system 
will  take  these  many  diverse  factors  into  account.  The  Bayesian  program  would  call  for  these  factors  to  be 
considered  jointly  rather  than  marginally  (separately).  Currently,  this  is  an  impossible  task.  A  satisfactory  but 
sub-optimal  approach  is  to  employ  a  team  of  specialized  processors.  Each  processor  works  on  a  special  aspect 
of  the  problem.  Each  processor  possesses  the  special  database  and  codes  designed  for  its  particular  aspect  of 
the  problem.  The  individual  processors  are  to  function  as  a  team  of  experts  and  arc  to  arrive  at  a  con.scnsus 
decision.  Each  processor  develops  hypotheses  and  probabilities  associated  with  the  hypotheses.  The  team 
members  exchange  these  hypotheses  and  by  this  process  enhance  the  marginal  viewpoints  to  a  more  global 
view.  The  process  must  somehow  converge  td  a  consensus  decision  of  the  team  perhaps  by  a  formal  method 
such  as  that  of  DcGroot  (DcGrooc74]  or  by  some  heuristic  method  such  as  the  HcarSay  system  (Lcs.scr73). 
The  approach  is  somewhat  similar  in  spirit  uj  the  "divide  and  conquer"  approach  of  algorithm  design  but 
lacks  the  optimality  of  the  full  Bayesian  program,  because  the  joint  infonnation  has  been  sacrificed.  That  is. 
the  intcr-dcpcndcncc  of  various  aspects  arc  only  approximated  by  the  indirect  approach  of  reaching  a  consen¬ 
sus  among  experts.  In  the  following  discussion,  we  refer  to  this  type  of  co-operative  decision  making  process 
as  consensus  decision  making. 

A  second  consideration  which  often  leads  to  decentralized  decision  making  is  the  delay  in  and  the  cost  of 
interprocess  communication.  In  many  real-time  applications  such  as  load  balancing,  information  delay  often 


289 


pl.iys  a  iliiininanl  role,  and  rapid  decision  in.ikiiig  may  he  eniei.il.  Hie  lime  delay  iiKiirred  in  eolleeiing 
complete  status  inlorm.ition  from  .ill  die  relcvaiu  parties  could  he  siiinciently  long  that  by  the  time  all  ihc 
infonnalion  is  giilhered  and  a  dexasion  is  made,  valuahlc  time  will  have  been  lost,  .nid  the  intiirmation  itsell’ 
will  Ik  substmtially  in  error.  To  obtain  a  timely  dc'cisiun,  the  decision  makers  may  he  Idrced  lo  make 
decisions  based  only  on  partial  int'ormaiion.  l-'or  example,  in  a  large  point-to-point  network,  u  dynamic 
routing  scheme  is  needed  for  messtige  transfers  through  the  network.  Clearly,  elaborate  information  gather¬ 
ing  and  consensus  ariiitrdtiun  is  self  defeating.  Instead,  one  might  adopt  deccntrali/.ed  pr(x;edures  such  as  the 
one  used  in  Arpanet  (Klcinrock76|  ,ind  .illow  each  niKlc  to  make  its  own  local  routing  decisions,  l-iirthcr- 
morc.  each  node  will  make  these  decisions  based  on  only  partial  information,  die  information  from  nearby 
nodes.  Information  from  more  distant  nodes  will  be  significantly  delayed  and  of  lesser  value,  'lliis  illustra¬ 
tion  highlights  the  importance  of  die  two  concepts  of  timeliness  and  completeness  of  information.  Since  this 
type  of  co-operative  decision  making  requires  decision  makers  to  work  as  a  team  to  further  the  system  level 
performance  but  docs  not  require  them  to  re.ich  an  agreed  upon  opinion,  wc  will  refer  to  this  type  of 
co-operative  decision  making  as  team  decision  making  (Marschak72].  A  single  team  member  is  allowed  to 
make  a  certain  set  of  decisions  w  itlunn  neccssiirily  gaining  the  concurrence  of  the  other  team  membeis.  Any 
team  member  may  seek  information  which  will  improve  the  quality  of  its  decisions.  All  team  members  make 
decisions  which  attempt  to  improve  overall  system  performance. 

The  consensus  decision  making  procedure  in  which  a  single  decision  is  reached,  and  the  team  decision 
procedure  can  be  thought  of  as  extremes  on  a  negotiation  spectrum.  Generally,  there  are  several  important 
facets  in  any  consensus  decision  making  problem  which  will,  in  part,  determine  the  most  appropriate  solution 
scheme.  These  include: 

•  Is  it  necessary  to  reach  a  single  consensus  among  all  decision  makers?  Can  a  single  decision  maker 
make  the  decision  after  taking  the  opinions  and  inftytmation  of  others  into  account? 

•  How  important  is  the  completeness  of  the  information  in  making  an  optimal  or  near  optimal 
decision?  That  is,  how  docs  the  quality  of  a  decision  deteriorates  as  the  completeness  of  infor¬ 
mation  decreases? 

•  What  is  the  computational  complexity  of  the  proposed  decision  process?  Docs  a  substantial 
reduction  in  complexity  cause  a  major  or  minor  reduction  in  performance? 

•  What  is  the  total  delay  of  a  proposed  decision  process?  This  includes  the  infonnation  gathering, 
negotiation,  and  decision  making.  More  importantly,  how  is  this  total  delay  compared  with  the 
relaxation  time  of  the  system?  I  he  relaxation  lime  is  a  measure  of  how  fast  the  system  is  changing 
its  status. 

In  die  following,  we  attempt  to  answer  some  these  questions  in  the  context  of  load  balancing. 


290 


3  The  Load  Balancing  Problem 

■|n  lhi^  sctlion.  wc  uniMdcr  llic  hud  lul.iiiciin;  proldciii  in  ihc  amtcxi  ol  ;!  local  disirilniicd  cnminiici  syMcin 
such  as  Archons  |jcnscn,S3|  in  which  usc'is  create  jnhs  at  the  nodes  lltcy  log  into.  Since  both  die  job  aii  isal 
pnx.c'ss  and  die  required  coniput.iiional  time  for  each  node  is  tinpredicUible.  llie  loads  at  each  of  the  nodes 
will  be  widely  variable.  Some  will  be  lightly  hiadcd.  while  others  will  be  heavily  loaded,  llie  tisk  of  die 
decision  makers  (local  operating  systems)  is  to  eq»iali/.c  the  loads  in  the  system  in  order  to  improve  the  system 
wide  performance.  ILiIanccd  loads  will  create  short  queues  and  short  response  times. 

I*ach  decision  maker  can  manage  its  own  job  queue  if  it  chtniscs  to  process  it  l(x:ally.  Load  balancing 
necessitates  the  shifting  of  load  from  one  processor  to  another,  and  this  ncccssiuitcs  some  form  of  consensus 
and  negotiation  involving  at  least  a  subset  of  the  processors.  At  the  other  end  of  the  spectrum,  one  processor 
acting  as  a  central  scheduler  will  a.ssign  die  job  to  one  of  the  processors.  Hie  no  negotiation  approach  can  be 
carried  further  by  letting  each  of  die  pr(x;cssoi's  schedule  its  own  jobs.  This  arrangement  would  then  be 
equivalent  to  the  team  decision  approach  whereby  no  consensus  is  needed,  but  the  individual  decision  makers 
act  to  improve  overall  system  pcrfttrmnncc.  There  is  no  a  priori  necessity  for  consensus.  Rather,  die  degree 
of  negotiation  must  be  dictated  by  pcrfonnance. 

Generally,  there  are  two  aspects  of  load  balancing.  The  first  aspect  is  the  matching  of  the  structure  of  the 
data  flow  in  the  application  with  the  architecture  of  the  distributed  computer  system.  This  includes  clustering 
closely  coupled  processes  in  the  same  or  nearby  nodes,  scheduling  tasks  witli  respect  to  precedence  relations 
in  such  a  way  that  more  tasks  can  be  executed  concurrently,  and  matching  the  special  resource  requirements 
of  tasks  with  special  hardware  etc.  From  a  modelling  point  of  view,  the  optimization  of  these  aspects  is 
typically  an  integer  programming  problem.  In  many  dedicated  rcaLtime  control  systems,  this  is  done  off-line 
due  to  the  lack  of  efficient  algorithms  for  solving  integer  programming  problems.  A  second  aspect  of  load 
balancing  is  the  control  of  system  dynamics.  In  a  distributed  system,  there  is  substantial  redundancy  built  into 
the  system.  Often  a  class  of  tasks  can  be  executed  with  near  equal  efficiency  by  any  of  a  group  of  similar  or 
identical  processors.  The  problem  then  is  the  equalization  of  the  load  so  as  to  improve  the  throughput  and  > 

response  times.  In  this  chapter,  wc  will  concentrate  on  this  latter  aspect 

Traditionally,  the  load  balancing  is  carried  out  by  a  centralized  scheduler.  However,  this  approach  is 
inadequate  for  many  larger  scale  systems  which  are  typically  loosely  coupled.  The  inadequacy  is  in  part  due 
to  reliability  considerations  and  in  part  due  to  the  time  delay  involved  in  routing  all  relevant  information  to  a 
centralized  location.  In  this  chapter,  we  will  investigate  a  highly  parallel  and  highly  decentralized  approach. 

In  this  approach,  each  processor  is  considered  to  be  a  member  of  a  management  team,  the  individual  actions 
of  which  arc  geared  to  increasing  the  system  level  performance.  The  crucial  requirement  is  the  development 
of  an  information  exchange  structure  and  an  assiKiatcd  decision  procedure  wliich  makes  such  an  approach 
effective. 


291 


In  'iimiin.n  ;>  i|1|K';iin  ih.ii  oiii:  rc.iMiii.ililc  .ipimi.n.!!  u>  in.al  h.iijuLiii!:  is  the  hinhly  (Jccs'iui.ili/ci.!  ic.im 
.ippin.icli.  In  tills  .ipiHti.ich.  s\ilIi  pi\K.\'ssiir  .ids  iik1iui1iijII\  Cu  inciwisc  sysK'in  [)i.i  I' inn.ini.c.  I  Ik'  kc', 
iiiincdionl  is  lo  develop  ,ni  inroritutntn  exeh.nigc  siriictuie  wliieh  will  le.id  Ui  inerensed  system  pei loiin.niee 
by  allowing  e.ieh  individn.il  proecssor  to  sthediilc  its  own  jobs  l  lic  most  dil'llciilt  aspeel  lies  in  obtaining  die 
relevant  system  suntis  inlbimation  in  a  timely  fashion.  In  addition,  the  actions  of  the  individual  processors 
must  be  siifricicntly  coordin.ited  to  gcner.ite  a  coherent  global  strategy,  even  though  it  is  implemented  in  a 
highly  dcccntrali/cd  fashion.  We  present  some  results  on  this  team  formulation  in  the  rest  of  Uiis  section. 

3.1  The  Value  Of  Perfect  Information 

In  developing  .1  highly  deceiitrali/ed  tc.im  .ippro.ich  to  the  lo.id  btilancing  problem,  an  important  considera¬ 
tion  is  the  v.ikie  of  system  status  information.  The  actii.il  information  .ivailabic  to  a  processor  will  often  be 
delayed,  incomplete,  and  inaccurate,  (t  is.  Iiowever.  useful  to  consider  the  ideal  case  of  "perfect”  infonnation. 
System  perforin. nice  measures  such  as  throughput  or  mean  response  time  arc  influenced  by  unavoidable 
queueing  or  congestion  delays  ,ind  dcl.iys  c.iused  by  imperfect  system  status  information.  To  quantify  die 
importance  of  system  status  information,  we  must  separate  out  the  extra  delay  incurred  when  information  is 
imperfect,  llic  approach  is  to  measure  system  performance  under  the  assumption  of  perfect  information  and 
to  also  measure  it  under  some  imperfect  information  scheme.  Ihic  former  gives  an  upper  bound  on  obtain¬ 
able  performance,  while  die  dilTcrcncc  measures  the  loss  in  performance  due  to  imperfect  infonnation. 

As  an  illustration,  let  us  contrast  tvyo  extreme  cases.  At  one  extreme,  we  assume  that  ail  processors  have 
complete,  accurate,  and  instantaneous  system  information.  At  the  other  extreme,  we  assume  that  processors 
have  no  system  status  information  at  all  (not  even  local  information).  In  both  cases,  the  processors  will  make 
the  best  possible  decisions,  but  since  different  infonnation  levels  are  assumed,  the  performance  achieved  in 
the  two  situations  will  be  different.  The  difference  in  performance  is  due  solely  to  the  availability  of  the 
status  information  and  provides  us  with  a  quantitative  measure  of  the  value  of  complete  information.  Tliis 
measure  will  help  us  to  identify  situations  in  which  information  is  very  valuable  and  situations  where  it  is  not. 

As  a  specific  example  of  this  approach,  we  assume  that  the  network  consists  of  n  homogeneous  processors. 
We  assume  jobs  arrive  at  the  i  th  according  to  a  Poisson  process  with  mean  jobs  per  unit  time.  All  jobs  are 
homogeneous  and  require  a  random  amount  of  time  to  process  given  by  an  exponential  distribution  with 
mean  m.  We  will,  for  the  moment,  ignore  all  other  factors  such  as  communication  costs,  priorities,  etc.  In  tlic 
case  of  perfect  information,  each  proecssor  is  able  to  act  as  a  central  scheduler.  If  a  processor  is  available  for 
work,  this  is  assumed  to  be  know  n  to  the  other  processors.  No  queues  will  form  if  other  processors  arc  idle. 
This  corresponds  to  an  .Vt/M/n  queueing  system  with  arrival  rate  \  =  -I-  ...  -I-  and  mean  service  rate 

1/m.  The  performance  of  sin.h  a  system  ^aii  be  explicitly  characterized.  I'or  example,  die  mean  response 
time  is  given  by  np((l-p)-f  C(n.np)/n)/(A(  I  -  p))  where  C  is  die  P.rlang  C  function  and  p  =  Am.  The  Erlang 
C  function  is  bounded  above  liy  1  and  is  usually  small,  llius  diis  formula  can  be  approximated  by  np/A. 
Other  quantities  such  as  processor  utilization  can  also  be  calculated. 


292 


111  llic  c;iNC  lit'  lU)  ^y^lom  sums  ii\rorm;iliiin.  c.w.h  piin.cs'.or  iiuist  iiiiikc  a  dt'cision  Im-icJ  mily  (ui  dio  A  .  m. 
ami  n.  That  is,  only  lony  run  avcrajic  mlumialidn  lathci-  (han  ilynainic  loading’  inl’ni in.iiimi  is  ai.iilahlc  Id  ilic 
processors  in  Uiis  ease.  I  ho  oplimal  assii’aimciU  sclicinc  is  probaliilisiic  .ind  results  in  an  o\cm  load  hcin-a  put 
on  all  processors.  It  is  important  to  observe  that  the  concept  ofa  bal.mccd  load  is  defined  here  only  in  the 
long  .un.  All  jobs  arc  allocated  so  that  in  the  long  run  the  same  number  of  jobs  will  be  handled  by  all  n 
prix-cssors.  The  lack  of  current  and  complete  system  stitus  information  means  that  tlie  processors  are  not 
able  to  take  current  loads  into  ticcounL  Imbalances  will  result,  and  system  performance  will  be  decreased. 

Ilic  situation  of  no  current  status  information  corresponds  to  a  collc'ction  of  n  independent  M/M/1  queue¬ 
ing  systems  with  Poisson  input  having  mean  rate  A/n  and  mean  service  time  m.  Performance  quantities  arc 
easily  calculated  for  such  a  system.  I  or  example,  the  mean  response  lime  is  given  by  np/(A{l  — p)).  It  is 
interesting  to  compare  the  mean  response  time  in  ilicse  two  extreme  cases,  the  complete  information  ease  and 
the  no  information  ease.  ITie  two  dilfer  by  the  multiplicative  factor  1  —  p  -t-  C(n,np)/n.  One  might  refer  to 
this  quantity  as  the  "information  factor".  I'hc  factor  is  equal  to  1.  its  minimum  value,  when  p  =  0,  and  it 
decreases  to  1/n  as  p  increases  to  1.  This  means  that  at  low  traffic  intensities  (p  near  0),  system  st.ituv 
information  provides  no  significant  reduction.  Iliis  lactor  can  be  calculated  for  any  traffic  intensity  p  and 
shows  that  information  is  very  valuable  at  moderate  and  high  intensities.  Furthermore,  the  value  of  infor¬ 
mation  increases  with  the  number  of  processors  in  the  network. 

3.2  The  Value  Of  Local  Information 

In  this  section,  we  continue  to  assume  a  situation  in  which  processes  and  processors  are  homogeneous. 
Furthermore,  we  do  not  consider  the  communicaiion  delays  or  the  time  cost  of  transferring  tasks  in  tliis 
section.  Now  we  wish  to  study  the  situation  in  which  each  processor  has  only  local  Information  (the  status  of 
its  own  job  queue)  and  knows  the  long  run  arrival  rate  of  work.  The  arrivals  are  assumed  to  be  independent 
Poisson  processes  with  common  parameter  A,  while  service  times  arc  again  exponential  with  mean  m.  Fach 
pre  'ssor  upon  receiving  a  job  must  determine  whether  to  process  it  locally  or  send  it  to  another  processor. 
This  decision  must  be  based  only  on  the  local  queue  len„  ^  and  is  made  without  any  negotiation.  We  also 
assume  that  if  it  is  sent  to  another  processor,  that  processor  must  accept  the  job  and  cannot  send  it  further. 
This  assumption  is  useful  in  that  it  prevents  excessive  job  swapping,  and  it  simplifies  the  analysis. 

A  general  class  of  control  policies  within  which  an  optimal  policy  must  lie  consists  of  a  set  of  probabilities. 
{p„}.  where  p^  gives  the  probability  that  the  processor  accepts  a  newly  arrived  task  when  its  local  queue 
consists  of  n  other  tasks.  These  probabilities  must  be  decreasing  in  n.  Optimal  stochastic  control  theory 
suggests  tliat  the  optimal  p^s  will  be  non-randomizcd  (either  0  or  1).  This  reduces  consideration  to  a  'control 
limit'  policy:  accept  a  task  if  and  only  if  die  local  queue,  including  die  one  being  served,  consists  of  1.  or 
fewer  tasks,  otherwise  send  it  to  another  prcKcssor  chosen  from  among  die  other  caadid.ucs.  The  best  value 
of  L  will  depend  on  the  traffic  intensity.  If  the  traffic  intensity  is  small,  then  L  =  1  is  appropriate.  For  larger 
values  of  p.  L  must  be  larger.  It  remains  to  determine  the  optimum  value  of  L  for  each  p,  and  to  assess  die 
performance  of  the  resulting  local  information  system. 


293 


Onto  I.  li.iN  Ixvn  d>  lciiiiiin.d.  llicic  is  .1  sccnnd  ph.isc  to  iho  policy  •  llic  clioicc  ol  ihc  pioccssoi  to  scud  the 
unj(.ccptcd  Lisk.  \  iiiinilvr  of  |iolii.ics.  Imiii  dctcrniinistic  .ind  stoch.isiic  c.in  Ih'  lived.  An  .ipinii.il  policy 
must  cqii.ili/c  t!ic  lo.id  on  llic  processors  in  llte  long  run.  lioyond  lh.it  rc'minenicnt,  we  expect  licit  the 
optmi.il  policy  will  niinnni/c  the  coctricient  of  vm  i.ition  of  the  resulting  .irriv.il  processes.  The  form  of  the 
optiinul  policy  is  still  on  open  question.  We  choose  a  policy  in  which  each  prt)cc*ssor  selais  randomly  from 
the  remaining  (n  -  1)  available  choices. 


Pie  I,  policy  queueing  system  has  not  been  studied  lycrorc,  and  it  is  quite  coinpliciiied  to  produce  c.xact 
analytic  solutions,  l•■ortunatcly,  a  simple  approximation  can  be  used  which  is  very  accurate  for  large  viilucs  of 
n(l  -p).  We  treat  c.ich  of  the  n  processors  as  independent  biriJvdcath  pnx:csscs  with  hinh  rate  b^  =  \(l  +s) 
for  n<l,.  and  b  =/\s  for  n>l.,  while  the  de.nh  rate  is  1/m  for  all  stales.  The  parameter  s  is  used  to  connect 
the  queues  together  and  is  chosen  to  ichicve  equilibrium.  The  unknown  s  is  a  rtnii  of  a  polynomial  equation. 
Once  s  has  been  determined,  approxniiace  equilibrium  distribution  and  its  mean  can  be  found.  Pie  deriva¬ 
tion  is  given  in  the  appendix  along  with  simulation  results  which  attest  the  high  accuracy  of  the  approxima¬ 
tion.  Ifwc  let  K  =np/(A(l  -p)),  then  we  can  determine  the  mean  response  times  to  be  given  by 


POLICY 

no  information 
local  control.  L*! 
local  control ,  L*2 
perfect  information 


MEAN  RESPONSE  TIME 

FCl] 

F[l/(Up)] 

F[l-p*p/((Up)2)] 

F[l-p  ■►C{n.np)/n]  ~  F[l-p] 


The  results  are  rather  surprising.  Pie  percentage  reduction  in  mean  response  time  for  perfect  information 
over  no  information  is  p.  The  percentage  reduction  for  the  L  =  1  policy  is  p/(l+p).  Of  the  total  p  percent 
reduction  for  perfection.  1/(1  -fp)  of  it  can  be  obtained  using  only  local  information.  Even  at  high  intensities, 
local  information  gives  over  50%  of  the  possible  gain.  If  one  allows  general  L's,  then  at  least  62%  is  possible 
for  any  traffic  intensity!  It  is  worthwhile  to  point  out  that  some  non-local  information  can  be  obtained  at 
nearly  no  additional  cost  by  keeping  track  of  the  source  of  each  task  in  the  local  queue.  Hence  one  can 
p  further  improve  system  performance  by  using  this  "free”  non-local  information.  For  example,  if  a  processor 

decides  to  ship  out  a  task,  it  should  be  sent  to  a  processor  which  has  not  shipped  it  anything  in  its  current 
queue.  This  all  suggests  that  very  effective  control  can  be  exercised  at  low  u’affic  intensities  using  just  local 
control.  At  higher  intensities,  the  mean  response  times  become  quite  long.  Even  though  local  control  can 
capture  a  high  percentage  of  the  overall  gain  possible  from  perfect  information,  it  becomes  important  to 
refine  the  decision  making  with  extra  information.  Spccincally,  one  needs  to  make  a  more  informative  choice 
of  the  processor  selected  to  process  the  task.  The  purely  random  choice  described  earlier  is  inadequate. 
Several  such  schemes  arc  suggested  in  die  next  section. 


294 


3.3  Appro-^uhes  to  Decentralized  Load  Balancing 

Oni. ..c  li.is  iiiinidui.i.'d  ,1  luc.isiiic  ut'Uic  value  uf  inl\inii.tiu)ii.  it  ivinams  to  ilcivltip  an  akuiiilun  wIiilIi 
will  nearly  achieve  the  UiCDielical  upper  buiind  un  peildnnancc.  In  ccneral,  prucessors  will  send  mess.igcs  to 
int’urin  other  processors  of  their  status.  I'hc  more  fiequeni  die  messages,  the  more  information  other  prixrc's- 
sors  vwill  have  L  iiforiunaiely,  these  mcssiigcs  create  overhead  which  serves  to  degrade  system  perfonnanec. 
lo  deal  with  this  tr.idcotT.  one  can  draw  an  analogy  with  team  decision  theory.  Information  is  valuable  and 
worth  the  cost  only  if  it  causes  the  recipient  to  change  his  decision  and  results  in  a  lower  overall  cost.  A 
message  should  he  sent  only  if  it  has  ,i  sutTicicntly  large  information  content  to  justify  the  cost  of  sending  it 
Here  we  arc  ignoring  the  reliability  considerations  which  may  dictate  that  a  mcs.sagc  be  sent  periodically.  We 
use  the  word  infomiation'  in  the  Shannon  sense.  I'hc  information  gained  from  a  mcs.sagc  is  equal  to  the 
uncertainty  removed.  Only  messages  describing  relatively  uniisuai  or  surprising  system  states  should  be 
transmitted.  In  the  load  balancing  ccnitcxt.  the  high  information  content  messages  arc  those  which  describe 
overloading  or  underloading  conditions.  Ihis  suggests  that  proec'ssors  should  transmit  status  messages  ac¬ 
cording  to  a  control  limit  policy.  ciUior  when  it  is  underloaded  or  overloaded  relative  to  its  equilibrium 
distribution.  Ilic  exact  control  liiniis  arc  determined  to  optimi/c  die  tradeoff  between  the  cost  of  information 
and  the  gain  in  performance  from  it. 

The  control  limit  policy  described  in  the  previous  paragraph  should  provide  a  very  cfftcienl  decentralized 
control  algorithm.  The  policy  is,  however,  one  based  on  long  run  equilibrium  calculations  which  ignore  short 
run  fluctuations.  A  processor  will  send  a  status  message  when  its  workload  deviates  from  its  long  run 
c.vpcctaiions.  It  is  possible  diat  dicrc  could  be  a  heavy  influx  of  jobs  over  a  short  period  (or  a  very  small 
influx),  and  many  processors  will  have  above  (below)  average  workloads.  This  causes  many  status  messages 
to  be  sent  which  further  degrades  the  system.  This  effect  can  be  overcome  by  having  control  limits  which  are 
determined  dynamically  and  change  based  on  short  term  load  fluctuations.  Unfortunately,  this  would  seem 
to  necessitate  sending  even  more  messages  to  identify  these  short  run  fluctuations.  There  is,  however,  an 
alternative  approach.  This  approach  is  based  on  having  each  processor  constnict  a  system  status  model.  If  wc 
assume  that  the  processors  are  linked  by  a  bus  connection,  then  each  processor  must  do  intensive  bus 
monitoring.  Each  processor  must  keep  track  of  all  traffic  on  the  bus,  noting  its  source,  its  destination,  and  if 
possible  its  estimated  processing  time.  This  information  allows  each  processor  to  build  and  update  a  system 
status  model.  Such  a  model  would  predict  the  current  workload  at  each  processor.  Based  on  this  model,  the 
processor  could  act  as  a  cenu-al  scheduler  to  determine  the  best  processor  to  handle  any  particular  task.  Tlie 
quality  of  these  decisions  will  be  totally  a  function  of  the  accuracy  of  the  model  used.  At  first  glance,  it  would 
seem  that  the  system  status  model  might  be  accurate  for  a  short  time  period,  but  its  quality  would  quickly 
deteriorate  as  time  passes,  due  in  part  to  the  uncertainty  of  the  exact  processing  requirements  of  any  par¬ 
ticular  task.  It  is  necessary  to  iiumducc  >umc  sort  of  feedback  control  to  keep  the  model  accurate.  This  can 
be  also  done  in  a  highly  dcccntr.ili/cil  efficient  manner.  Under  the  assumption  that  the  processors  arc 
coupled  by  a  bus,  they  will  all  sec  the  same  traffic.  As  a  result  they  will  all  build  identical  system  status 
models.  Thus  the  models  used  by  each  of  the  processors  gives  essentially  identical  predictions  of  the 


295 


,it  ,my  |i.irnt.iikii'  In  ailduiou.  c.icli  of  Ihc  priK.osMiis  li.is  cxir.i  mItMiii.ilutii  iii  ill, it  il 

kiKms  ihc  ox.icl  wdiklii.id  I'nr  iisclf.  I  priiccsMir  (.;in  comp.iic  ils  ciwn  vxoikKi.id  \miIi  tiio  wnikliud  .ix 
predicted  by  die  common  sysiein  statll^  model.  Hie  iwo  will  of  course  deviate,  ll  is  only  imporl.nu  lo  noiily 
die  odier  pa)tcs.sors  when  this  Ucvi.uion  becomes  significant,  I  bis  is  again  done  on  a  control  limit  basis. 
When  the  actual  load  is  significantly  larger  or  smaller  diaii  die  load  predicted  by  the  model.  tli;it  processor 
must  send  a  mcss.ige  to  all  other  prtKcssors  to  have  that  part  of  die  model  upd.iicd.  'riic  exact  control  limits 
must  be  determined  ui  tradeoff  die  increase  in  pcrfonnance  wiUi  the  overhead  costs  in  sending  die  mc's.sage. 
In  diis  fashion  an  effeedve  fccdbaclc  control  mechanism  can  be  established  in  a  highly  dcccntrali/cd  fiishion. 

4  Future  Research 

the  previous  section  indiciites  how  one  can  quantify  the  value  of  system  status  information.  We  wish  to 
generalize  dicse  results  to  a  broader  context,  I  his  will  be  done  as  follows.  We  will  assume  that  both  jobs  and 
processors  may  be  heterogeneous.  Kurdicr  wc  will  reintroduce  die  costs  (in  time)  of  communication  as- 
six:iatcd  with  sending  jobs  around  the  network.  A  number  of  models  arc  possible  to  handle  this  greater 
degree  of  generality.  Wc  might  assume  diat  processes  arriving  at  processor  i  (at  rate  \^)  can  be  processed  Iv, 
any  other  processor,  but  the  amounts  of  time  required  will  vary.  A  number  of  factors  will  cause  these  time 
differences:  differing  communication  times,  the  capabilities  of  the  particular  processors,  the  requirements  of 
the  job,  the  location  of  the  data,  etc.  Wc  assume  that  the  time  to  process  a  job  at  processor  j  is  random  and 
has  a  distribution  with  mean  m..  and  variance  s^..  From  diis  structure,  one  must  evaluate  the  system 
pcrfonnance  under  a  full  information  and  a  no  information  structure. 

Under  the  assumption  of  no  dynamic  information,  the  decisions  must  be  based  solely  on  the  parameters 
and  Fj^.  Again  a  probabilistic  algorithm  will  be  optimal  Performance  quantities  such  as  the  mean  response 
dme  can  be  calculated  from  the  Pollaczek-Khinchin  formula.  The  full  information  optimal  performance 
evaluation  can  be  carried  out  by  a  straightforward  Lagrange  muldplicrs  argument. 

It  is  intuitively  clear  chat  heterogeneiry  in  general  reduces  the  value  of  system  status  information.  Ihis 
observation  follows  because  knowledge  of  the  availability  of  a  particular  processor  will  be  useless  for  jobs 
which  are  mismatched  to  that  processor.  Heterogeneity  increases  the  number  of  such  mismatches  and  thus 
reduces  the  number  of  cases  where  information  is  of  value. 

In  summary,  there  remain  a  variety  of  important  issues  yet  to  be  investigated.  First,  there  is  a  need  to 
evaluate  the  response  times  of  systems  more  general  than  Uic  one  analyzed  in  sections  1  and  2.  This  is 
outlined  above.  Second,  one  must  evaluate  the  performance  of  the  various  control  strategics  outlined  in 
section  3.3.  llicse  initial  results  do.  however,  indicate  that  the  team  decision  approach  with  limited  lov.il 
information  will  offer  a  very  higli  performance  decentralized  system. 


296 


5  Appendix 

I  he  bchiivior  orihc  network  unUcr  .m  t  policy  can  he  dclcnnincvl  approximately  using  a  simple  birth  and 
death  process  model.  We  focus  allcniion  onut  a  single  node.  This  node  rcceixes  external  input  according  to  a 
l*oisson  (X)  process.  It  accepts  and  gueues  titese  tasks  if  there  are  currently  fewer  than  I.  tasks  there.  It  sends 
the  work  elcscwhcrc  if  it  currently  has  L  or  more  tasks.  Ilic  node  also  accepts  tasks  sent  by  other  nodes  titat 
invoked  the  I.  policy.  Iltc'sc  tasks  must  be  kept  and  processed.  That  is.  each  task  can  be  only  sent  once  by  a 
node.  We  as.sumc  the  Siinte  external  input  rate  X  and  exponential  service  rate  I/m  at  each  nude. 

The  queue  length  pnicess  at  each  nude  is  treated  as  a  birth  and  death  process  with  constant  dctith  rate  l/m. 
lltc  birth  rate  is  b.  =  X  +  \6  -  X(l  +  1?)  if  0  <  i  <  I.  and  is  Xd  if  i  >  I..  Here  d  is  the  fraction  of  traffic 
forwarded  by  a  node  and  is  equilibrium  distribution  of  the  birth-death 

process.  One  can  find  the  equilibrium  distribution  from  standard  birtit  and  death  theory. 

»k  =  *on*r^b.mj^,).  l<k<00 

y"  *  =  1 

Coosequendy. 

,  +  l^k^L 

+  k>L 

3.5.1  Cas6  1:  L  s  1 

We  consider  the  case  L  =  1.  Here,  the  equilibrium  distribution  is  given  by 


_fpil  +  9)w„  k=l 


*k-\ 


*>1 

00 


where  =  I  -  Wg,  1  =  2y- 


•’'r 


The  condition  1  =  y*®'***  Wq(1+p)/(1-p5)  =  1.  The  further  requirement  =  1  -  gives 

=  1  -  p.  so  ^  =  p.  This  gives  the  equilibrium  distribution  in  terms  of  p  alone. 


,,  =  {1--  *=» 

p**-‘(l-l-pXl-p)  k>l 

The  mean  queue  length  can  be  calculated  directly  from  this  distribution  to  be  pM(l“pnl+p)l-  Little’s 
formula  can  be  used  to  find  tlic  waiting  time  =  F(l/(l-l-p)). 


297 


5.2  Cose  2:  L  s  2 

riic  aiKilysis  for  I.  =  2  is  similar  to  the  I.  =  1  ease.  Ilic  cguilthrinm  distribution  is  y.ivci)  by 


where  s'., *j  =  l.»  -  ^ 


■ITie  three conditiorts.  ®  and  *,  =  pd  +  tf)*^  can  be  used  to  find  the  equilibrium 

distribuUon.  It  is  »q  =  1  -  p  and 


,^  =  |p(l+pXl-p)/(l  +  p(l-p))  *=1 

^(p V(  1  +  p(  1  -  p)l*(  1  +  p)'(  1  -  p)/p*  k>2 

Once  again,  the  equilibrium  queue  length  can  be  found  this  distribution.  Hie  waiting  time  is  given  by 
F(l-p+p/(l+p)2) 


5.3  Simulation  Results 

It  is  important  to  determine  the  accuracy  of  the  previous  approximation.  A  simulation  was  carried  out  fi'r 
that  purpose.  Selected  results  are  presented  below  for  the  case  of  5  nodes.  The  results  show  that  tlic  ap¬ 
proximation  is  extremely  accurate  for  traffic  intensities  up  to  0.S  and  still  accurate  when  traffic  intensity  is  0.7. 
At  high  traffic  intensities,  the  network  becomes  difficult  to  simulate  accurately  as  extremely  long  runs  arc 
needed.  The  approximation  will  work  better  as  the  number  of  nodes  increases,  but  simulations  are  difficult  to 
carry  out  in  such  cases.  The  followings  are  simulation  results  and  theoretical  results  of  the  probability  of  n 


jobs  in  the  queue.  In  addition,  the  mean  queue  length  of  simulation  results  and  theoretical  results  are  also 
compared. 


Run  1:  Traffic  Intensity 

=  0.1,  Policy: 

Simulation 

L  =  1 

Theory 

Prob[n  =  0] 

0.902 

0.900 

Prob[n  =  1] 

0.097 

0.099 

Prob[n  =  2] 

0.001 

0.001 

Prob[n>3] 

0.000 

0.000 

Mean  queue  length 

0.099 

0.101 

298 


Run  2;  Traffic  Intensity  “  0.1.  Policy:  I.  =  2 


Simulation 

Iheorv 

Prob[n  =  0] 

0.901 

0.900 

Prob[n  =  1] 

0.090 

0.091 

Prob[n  =  2] 

0.009 

0.009 

Prob[n>3] 

0.000 

0.000 

Mean  queue  length 

0.108 

0.109 

Run  3:  Traffic  Intensity 

=  0,3.  Policy:  L  = 

1 

Simulation 

Prob[n  =  0] 

0.702 

0.700 

Prob[n  =  1] 

0.268 

0.273 

Prob[n  =  2] 

0.026 

0.025 

Prob[n  =  3] 

0.003 

0.002 

Prob[n>4] 

0.001 

0.000 

Mean  queue  length 

0.333 

0.329 

Run  4:  Traffic  Intensity  =  0.3,  Policy:  L  =  2 


Prob[n  =  0] 

0.699 

0.700 

Prob[n=!  1] 

0.225 

0.226 

Prob[n  =  2] 

0.074 

0.073 

Prob[ns:3] 

0.002 

0.001 

Prob[n^4] 

0.000 

0.000 

Mean  queue  length 

0.379 

0.375 

Run  5:  Traffic  Intensity 

=  0.5,  Policy:  L  =1 

Simulation 

Prob[n  =  0] 

0.502 

0.500 

Prob[n  =  l] 

0.365 

0.375 

Prob[n  =  2] 

0.094 

0.094 

Prob[n  =  3] 

0.027 

0.023 

Prob[n  =  4] 

0.008 

0.006 

Prob[n  =  5] 

0.003 

0.002 

Prob[n>6] 

0.001 

0.000 

Mean  queue  length 

0.687 

0.666 

299 


Run  (5:  fraffic  Intensity  = 

0.5.  Policy: 

1. 

==  2 

Simulation 

Theory 

Prob[n  =  0] 

0.502 

0.500 

ProbCnsI] 

0.292 

0.300 

Prob[n=:2] 

0.179 

0. 180 

Prob[n  =  3] 

0.022 

0.018 

Prob[n  -4] 

0.004 

0.002 

Prob[n>5] 

0.001 

0.000 

Mean  queue  length 

0.737 

0.722 

Run  7:  Traffic  Intensity  = 

0.7.  Policy: 

L 

=  1 

Simulation 

Theory 

Prob[n  =  0] 

0.299 

0.300 

Prob[n  =  l] 

0.344 

0.357 

Prob[n  =  2] 

0.167 

0.175 

Prob[n  =  3] 

0.087 

0.086 

Prob£n  =  4] 

0.046 

0.042 

Prob[n  =  5] 

0.025 

0.021 

Prob[n  =  6] 

0.014 

0.010 

Prob[n  =  7] 

0.008 

0.005 

Prob[n>8] 

0.010 

0.004 

Mean  queue  length 

1.468 

1.365 

Run  8:  Traffic  Intensity  = 

0.7.  Policy: 

L 

=  2 

Prob[n  =  03 

0.301 

0.300 

Prob[n  =  l] 

0.279 

0.295 

Prob[n  =  2] 

0.273 

0.290 

Prob[n  =  3] 

0.086 

0.082 

Prob[n  =  4] 

0.034 

0.023 

Prob[n  =  5] 

0.015 

0.007 

Prob[n  =8] 

0.007 

0.002 

Prob[n>7] 

0.005 

0.001 

Mean  queue  length 

1.371 

1.267 

300 


References 

(Allcliiii  82)  Allchin.  J.imcs  1'.  .iiiJ  M.iitin  S.  MtKcndry. 

Oh/rct  S\  iu  lin)ni:tuinn  atui  Kcan'cry. 

I  cthnicai  Kcpoit  (il  l  -10^-82/15.  Sthix)!  ol'  Information  anil  Computer  Science,  Georgia 
Instiliuc  of'l  ecliiiology,  1982. 

(Altar  84)  AtLir,  R..  Ilcrnstcin  P.  A.  and  Goodman  N. 

Site  initiali/ation.  Recovery  and  Itackup  in  a  Distributed  Database  System. 

//•■V-.'/f  Transaction  on  Software  Tnniiicering ,  Nov.,  1984. 

[Bentley  83)  Bentley.  J.  I..,  Johnson.  D.  S.,  I.eighton.T.  and  McGcoch,  C. C. 

,1/1  F.xpericinenial  Study  of  Din  Packing. 

I'cchnical  Report,  Bell  1  .aboratoric'S.  Murray  Hill.  N.J.  07974, 1983. 

(Bernstein  83)  Bernstein.  P.  A.,  Goodman,  N.  and  Had/.iacus  V. 

Recovery  Algorithms  for  Database  Systems. 

Technical  Report.  Aiken  Computation  laboratory.  Harvard  University ,  March  1983. 

(ChuSO)  Chu.  W.  W.;  Holloway.  I..  J.;  I.an.  M.:  Efe.  K. 

Task  Allocation  in  Distributed  Data  PrtKCSSing. 

Computer  \}(\l):5 1-()9.  November,  1980, 

(Chvatal  83)  Chvatal,  V. 

Linear  Programming 

W.  H.  Freeman  and  Company.  1983. 

[Coffman  83]  Coffman  Jr.,  E  G.,  Garcy,  M.  R.  and  Johnson,  D.  S. 

Approximation  Algorithms  for  Bin  Packing-  An  Updated  Survey. 

Technical  Report.  Bell  Laboratories,  Murray  Hill,  N.  J.,  1983, 

[DcGroot  74]  DeGroot.  M.  H. 

Reaching  a  Consensus. 

Journal  of  the  American  Statistical  Association  69(J45):118-121,  Marche  1974. 

(Dolev  82]  Dolcv,  D. 

The  Byzantine  Generals  Strike  Again. 

Journal  of  /l/go/’/7/t/«s  3(l).T4-30,  March,  1982. 

[F.swaran  76]  F.swaran,  K.  P.,  J.  N.  Gray,  R.  A.  Lorie  and  I.  L.  Traiger. 

The  Notion  of  Consistency  and  Predicate  Lock  in  a  Database  System. 

CACM,\m. 

[Fischer  82]  Fischer,  M.  J.;  Lynch,  N.  A.;  Paterson,  M,  S. 

Impossibility  of  Distributed  Consensus  with  One  Faulty  Process. 

Technical  Report,  Massachusetts  Institution  of  Technology,  September,  1981 

[Frederickson  80]Frederickson,  G.  N, 

Probabilistic  Analysis  for  Simple  One  and  Two  Dimensional  Bin  Packing  Algorithms. 
Information  Processing  Letters.  VoLlL  No.  4,  5. .  Dec.  1980. 

[Garcia-Molina  83] 

Garcia-Molina,  H. 

Using  Semantic  Knowledge  For  Transaction  Processing  In  A  Distributed  Database. 

ACM  Transaction  on  Database  Systems,  Vol  S.  No.  2 ,  June,  1983. 


301 


|(iruvcs  70( 

[Graves  81] 

[(lilltcrSOj 

[Jeffrey  84] 

[Jensen  83] 

[Jensen  84] 

[Kaku83] 

[Kaip72] 

[KauJinann  75] 

[Kleinrock  76] 

[Korth  83] 

[Kung79] 

[luunport  82] 


(iiavcs.  (i.  VV.  .mil  Wliinslon.  A.  H. 

An  Algol  illiiii  lor  riicOuailiatic  Assignment  I’rohlem. 

Miiiitiacinciit  Scicm  c.  1  <>/.  /  7.  No.  7 .  Mar 

Graves.  S.  C. 

A  Keview  of  i'roiliiclion  Scheduling. 

Opcnitintix  Rncan  h  2‘)(4):t>46-675.  July-August,  1981. 

Hillier.  K  S.  and  I  .ieberman.  G.  J. 

An  IntnxJuclion  to  Operations  Researck-Srd  F.dition. 

Holdcn-l>ay  Inc..  1980. 

Jeffrey.  R.C. 

The  Logic  of  Decision.  Second  Edition. 

University  of  Chicago  I’rcss,  Clticago  and  I  .ondon.  1984. 

Jensen.  K.  D. 

nic  Archons  Project;  An  Overview. 

In  Proceedings  of  the  International  Symposium  on  Synchronization,  Control,  and  Com¬ 
munication  in  Distributed  Systems,  pAges  31-35.  Academic  Press,  1983. 

Jensen.  E.  D. 

ArchOS:  A  Pliysically  Dispersed  Operating  System. 

IEEE  Distributed  Processing  Technical  Committee  Newsletter ,  June,  1984. 

Kaku,  B.  K.  and  Thompson.  G.  L 

An  Exact  Algorithm  for  The  General  Quadratic  Assignment  Problem. 

Technical  Report.  Graduate  School  of  Industrial  Administration,  Comegic-Mellon  Univer¬ 
sity.  Mar.  1983. 

Karp,  R.  M. 

Reducibility  among  Combinatorial  Problems. 

Complexity  of  Computer  Computations. 

Plenum  Pi^  New  York,  1972,  pages  397-411. 

Kaufinann,  A. 

Introduction  to  the  Theory  of  Fuzzy  Sets. 

Academic  Press,  New  York,  1975. 

Kleinrock,  L 

Queueing  Systems  Vol  II,  pp  424  -  425. 

John  Wiley  and  Sons,  1976. 

Korth,  H,  F. 

Locking  Primitives  in  a  Database  System. 

JACM,  Vol.  30,  No.  I ,  January,  1983. 

Kung,  H.  T.  and  C.  H.  Papadimitriou. 

An  Optimal  llicory  of  Concurrency  Control  for  Databases. 

In  Proceedings  of  the  SIGMOD  International  Conference  on  Management  of  Data,  p.iges 
116-126.  ACM.  1979. 


Lamport.  L.;  Shostak,  R.;  Pease,  M. 

The  Byzantine  Generals  Problem. 

ACM  Transactions  on  Programming  Languages  and  Sj'S/ewr  4(3):382-401.  July,  1982. 


302 


II  csscr  731  r.nn.in,  I  .  I )..  I  cmicll.  K.  I  >..  I  csscr.  V.  K..  .mil  Ucilily,  I ).  K. 

System  ()r^;mi/ati<»ns  (or  Spcoch  LliiclorM.imJiiig. 

I  he  riiihl  liileniuiKiHdl  .liiitii  (  oiifereiu  e  on  AriiJ'n  itil  .  August  l‘)73. 

[I  .iskov  85]  I  .iskov.  1$.  .md  Wcihl  W. 

SpcciJiCiUinn  of  Disiribuied  Programs. 

I’cchiiical  Report,  M.l.  I ..  Sept  1985. 

(Liu  73)  I.iu.C.L.;l.iiyland,J.W. 

Scheduling  Algorithms  for  Multiprogramming  in  a  Hard- Real-Time  I-.nvironmcnL 
Journal  of  the  Association  for  Computing  Machinery  20(  I  ):46-6 1.  January.  1973. 

[Lynch  82J  I.ynch,  N.  A.;  Fischer.  M.  J.;  l-'owlcr.  R.  j, 

A  Simple  and  l-.fficicnl  Hy/antine  Generals  Algorithm. 

In  PrrjceeJings  of  the  Srcoml  Symposium  on  Reliability  in  Distributed  Software  and  Datalxtse 
Systems,  pages  46-52.  ACM-IFKF.,  July  19-21, 1982. 

[Lynch  83]  Lynch.  N.  A. 

Multi-level  Atomicity  -  A  New  Correctness  Criterion  for  Database  Concurrency  Control. 
ACM  Transaction  on  Database  Systems,  Vol.  8,  No.  4 ,  December,  1983. 

[Marschak  72]  Marschak,  J.  and  Radner,  R. 

Economic  Theory  of  Teams. 

Yale  University  Press,  1972. 

[McDermott  80]  McDermott.  D.;  Doyle,  J. 

Non-Monotonic  Logic  1. 

Artificial  Intelligence  13:41-72, 1980. 

it1cDermott82]  McDermott,  D. 

Non-Monotonic  Logic  II. 

Journal  of  the  Association  for  Computing  Machinery  29(l):33-57,  January,  1982. 

[Mohan  83]  Mohan,  C. 

Efficient  Commit  Protocol  for  The  Tree  Process  Model  of  Distributed  Transactions. 

IBM  Research  Report,  RJ  3881  (44078),  Computer  Science ,  June,  1983. 

[Mohan  85]  Mohan,  C.,  Fussell,  D..  Kedem  Z.  M.  and  Silberschau  A. 

Lock  Conversion  in  Non-Two-Phase  Locking  Protocoil 
IEEE  Transaction  oA  Software  Engineering ,  Jan.,  1985. 

[Papadimitriou  84] 

^padimitriou.  C.  H.  and  Kanellakii  P.  C. 

Ob  Concurrency  Control  by  Multiple  Versions. 

ACM  Transaction  on  Database  Systems ,  Mar.,  1984. 

[Sahni76]  SahntS.K. 

Algorithms  for  Scheduling  Independent  Taski 

Journal  of  the  Association  for  Computing  Machinery  23(1):  116-127,  January,  1976. 

[Schwarz  84a]  Schwarz,  Peter  M.  and  Alfred  Z.  Spector. 

Synchronizing  Shared  Abstract  Typei 

ACM  Transaction  on  Computer  Systems ,  AUG,  1984. 

[Schwarz  84b|  Schwarz.  P. 

Transactions  on  Typed  Objects. 

PhD  thesis.  I>:paruTicnt  of  Computer  Science,  Camegie-Mellon  University,  1984. 


303 


jSIia  85.i|  Sh.i,  I  . 

Mniluliir  ('mu  urrt'm  v  ( 'oiiirol  tiiiJ  I  tiilun-  Kcanrry  --  (  (iiisislciic).  C  urm  tiicis  aiul 
Opiiniality. 

I’hl )  ihcsis,  Dcpamnciu  ufl'lcctrical  and  Compmcr  laiginccrini'.  Carncjiic-Mclli)n  Univer¬ 
sity.  1985. 

[Sha  85b]  Sha.  I ...  I  .choc/.ky.  J.  P.  and  Jensen  H.  1). 

Modular  Concurrency  Control  and  Kailurc  Recovery  Part  II;  l*ailurc  Recovery. 
Submitted  far  publication ,  1985. 

[Sha  85c|  Sha,  f...  I.ehoczky.  J.  P.  and  Jensen  K.  D. 

Modular  Concurrency  Control  and  I'ailure  Recovery  ™  Part  I:  Concurrency  Control. 
Submitted Jbr  publication ,  1985. 

(Silberschau  80]  Silbetschaiz.  A.,  and  Kcdcm. 

Consistency  in  Hierarchical  Database  Systems. 

JACM  27:1 . 1980. 

[Stankovic  83]  Stankovic.  J.  A. 

llayesian  Decision  Ihcory  and  Its  Application  to  Decentralized  Control  of  Job  Scheduling. 
February,  1983. 

Internal  d(x:uincntation.  University  of  Massachusetts.  Department  of  Electrical  and  Com¬ 
puter  Engineering. 

[Stefik  82]  Stefik,  M.;  Aikins.  J.;  Ilalzer.  R.;  ficnoit,  J.;  Birnbaum,  L;  Hayes-Roth,  F.;  Sacerdoti.  E 
The  Organization  of  H.xpert  Systems.  A  Tutorial. 

Artificial  Intelligence  18:135-173, 1982. 

[Svobodova  84]  Svobodova,  L 

Resilient  Distributed  Computing. 

IEEE  Transaction  on  Software  Engineering,  May,  1984. 

[Tokuda  85]  Tokuda,  H.,  Clark,  R.  K.  and  Locke.  C.  D. 

Arckons  Operating  System  (ArchOS)  —  Client  Interface,  Part  11. 

Technical  Report,  Department  of  Computer  Science,  Camegie-Mellon  University,  1985. 

[Weihl  84]  Weihl,  W.  E 

Specification  and  Implementation  of  Atomic  Data  Types. 

[%D  thesis,  Massachusetts  Institute  of  Technology,  1984. 

[Wulf  81]  Wulf,  W.  A.;  Levin.  R.;  Harbison,  S.  P. 

HYDRA/C.mmp:  An  Experimental  Computer  System. 

McGraw-Hill,  Inc.,  1981. 

[Zadeh  79]  Zadeh,  L  A, 

A  Theory  of  Approximate  Reasoning. 

Machine  Intelligence  9. 

Wiley.  New  York.  1979. 


305 


A  Best-Effort  Scheduling  Model  for 
Real-Time  Operating  Systems 

E.  Douglas  Jensen,  C.  Douglass  Locke,  HideyukI  Tokuda 

Computer  Science  Department 
Carnegie-Mellon  University,  Pittsburgh,  PA  15213 


This  work  was  supported  in  part  by  the  USAF  Rome  Air  Development  Center  under  contract  number 
F30602-81  •C-0297,  the  U.S.  Naval  Ocean  Systems  Center  under  contract  number  N66001 -81  •C-0484, 
the  U.S.  Navy  Office  of  Naval  Research  under  contract  number  N51160-84-C  0484,  and  the  IBM 
Corporation,  Federal  Systems  Division.  The  views  and  conclusions  contained  in  this  document  are 
those  of  the  authors  and  should  not  be  interpreted  as  representing  the  official  policies,  either 
expressed  or  implied,  of  RADC,  NOSC,  ONR,  IBM,  or  the  U.S.  Government. 


306 


Abstract 

Process  scheduling  in  real-time  systems  has  almost  invariably  used  one  or  more  of  three  algorithms: 
fixed  priority,  FIFO,  or  round  robin.  The  reasons  for  these  choices  are  simplicity  and  speed  in  the 
operating  system,  but  the  cost  to  the  system  in  terms  of  reliability  and  maintainability  have  not 
generally  been  assessed. 

This  paper  originates  from  the  notion  that  the  primary  distinguishing  characteristic  of  a  real-time 
system  is  the  concept  that  completion'  of  a  process  or  a  set  of  processes  has  a  value  to  the  system 
which  can  be  expressed  as  a  function  of  time.  This  notion  is  described  in  terms  of  a  time-driven 
scheduling  model  for  real-time  operating  systems  and  provides  a  tool  for  measuring  the  effectiveness 
of  most  of  the  currently  used  process  schedulers  in  real-time  systems.  Applying  this  model,  we  have 
constructed  a  multiprocessor  real-time  system  simulator  with  which  we  measure  a  number  of  well- 
known  scheduling  algorithms  such  as  Shortest  Process  Time  (SPT),  Deadline,  Shortest  Slack  Time, 
FIFO,  and  a  fixed  priority  scheduler,  with  respect  to  the  resulting  total  system  values.  This  approach 
to  measuring  the  process  scheduling  effectiveness  is  a  first  step  in  our  longer  term  effort  to  produce  a 
scheduler  which  will  explicitly  schedule  real-time  processes  in  such  a  way  that  their  execution  times 
maximize  their  collective  value  to  the  system,  either  in  a  shared  memory  multiprocessing  environment 
or  in  multiple  nodes  of  a  distributed  processing  environment, 

1 .  Introduction 

Process  scheduling  in  a  computer  operating  system  is  merely  an  instance  of  an  extensively  studied 
problem  from  Operations  Research  (OR),  in  which  it  takes  the  form  of  producing  a  sequence  of  jobs 
which  must  utilize  a  common  resource  (e.g.,  a  lathe  in  a  machine  shop)  such  that  some  metric  (e.g., 
number  of  parts  made  in  a  day)  is  optimized  (maximized  or  minimized).  In  the  real-time  operating 
system  as  well,  the  choice  of  process  scheduling  algorithms  to  allocate  cpu  resources  can  have  a 
very  important  impact  on  the  system  performance  as  well  as  on  the  reliability  and  maintainability  of 
the  system. 

This  paper  is  written  from  the  context  of  the  Archons  project,  currently  under  way  at  Carnegie-Mellon 
University.  Archons  is  undertaking  to  create  new  resource  management  paradigms  which  are  as 
intrinsically  decentralized  as  possible,  then  applying  them  as  the  foundation  of  an  experimental 
decentralized  real-time  operating  system  which  will  then  be  used  in  the  design  and  construction  of  a 
"decentralized  computer"^'  The  process  scheduler  for  this  system  constitutes  one  of  the  critical 
research  interests  of  the  Archons  project,  which  will  be  expected  to  manage  time  resources  to  ensure 
that  the  most  important  processes  meet  their  time  constraints,  even  in  the  event  of  an  overload 


307 


condition^. 

In  the  remainder  of  this  section,  we  discuss  real  time  systems  and  the  current  state  of  real-time 
process  scheduling,  while  in  Section  2  we  define  our  time-driven  scheduling  computational  model.  In 
Section  3  we  describe  our  experimental  simulation  results,  summarizing  our  results  in  Section  5. 

1.1.  The  Scheduling  Problem 

Scheduling  a  set  of  processes  consists  of  sequencing  them  on  one  or  more  processors  such  that  the 
utilization  of  resources  optimizes  some  scheduling  criterion.  Criteria  which  have  historically  been 
used  to  generate  process  schedules  include  maximizing  process  flow  (i.e.,  minimizing  the  elapsed 
time  for  the  entire  sequence),  or  minimizing  the  maximum  lateness  (lateness  is  defined  to  be  the 
difference  between  the  time  a  process  is  completed  and  its  deadline).  It  has  long  been  known®  that 
there  are  simple  algorithms  which  will  optimize  certain  such  criteria  under  certain  conditions,  but 
algorithms  optimizing  most  of  the  interesting  scheduling  criteria  are  known  to  be  NP-complete®, 
indicating  that  there  is  no  known  efficient  algorithm  which  can  produce  an  optimum  sequence. 
Clearly,  the  choice  of  metric  is  crucial  to  the  generation  of  a  processing  sequence  which  will  meet  the 
goals  of  the  system  for  which  the  schedule  is  being  prepared. 

There  are  a  number  of  significant  differences  between  scheduiing  as  it  is  practiced  in  most 
a;,  olications  of  OR  and  process  scheduling  in  an  operating  system.  In  OR,  scheduling  is  traditionally 
performed  statically,  off-line,  while  in  an  operating  system,  the  information  from  which  scheduling 
decisions  must  be  made  is  dynamic,  suggesting  that  the  best  scheduling  decisions  should  be  made 
dynamically.  In  addition,  classic  OR  scheduling  problems  assume  (for  the  sake  of  computational 
simplicity,  not  particularly  for  realism)  that  all  the  jobs  to  be  scheduled  and  their  processing  time 
requirements,  are  available  at  the  beginning  of  the  sequence  (f  =  0),  and  that  new  jobs  will  not 
suddenly  appear  during  processing.  If  a  change  in  any  data  involved  in  the  scheduling  decision  (such 
as  the  arrival  of  a  new  job)  should  occur,  the  previously  computed  sequence  is  invalidated,  and 
scheduling  must  be  started  over  if  optimality  is  to  be  maintained.  On  the  other  hand,  in  the  real-time 
operating  system  environment,  processes  frequently  arrive  and  depart  (i.e.,  are  completed  or 
terminated)  at  irregular  intervals  and  have  stochastic  processing  times. 

Real-time  processes  can  be  partitioned  into  two  categories:  periodic  and  aperiodic.  Periodic 
processes  arrive  at  regular  intervals,  while  aperiodic  processes  arrive  irregularly,  but  even  periodic 
processes  can  be  terminated  or  undergo  changes  in  period  due  to  application  dependencies, 
resultirig  in  a  constantly  changing  set  of  schedulable  processes. 


308 


1 .2.  Real-Time  Systems 

Although  many  real-time  systems  have  been  constructed,  the  fundamental  differences  between  a 
real-time  system  and  other  computer  systems  have  not  always  been  clearly  understood.  It  is  our 
thesis  that  the  most  important  difference  between  the  real-time  operating  system  and  other  computer 
systems  is  that  in  a  real-time  system,  the  completion  of  a  process  has  a  value  to  the  system  which 
varies  with  time^.  Although  the  time  to  complete  a  process  is  of  some  importance  in  all  computer 
systems,  in  a  real-time  system  the  response  time  is  viewed  as  a  crucial  part  of  the  correctness  of  the 
application  software;  a  computation  which  is  late  is  very  frequently  no  better,  or  perhaps  even  worse, 
than  one  producing  an  incorrect  result. 

This  time  value  is  usually  described  in  terms  of  deadlines  by  which  computations  must  be  completed, 
and  the  deadlines  are  generally  traceable  to  the  physical  environment  with  which  the  system  must 
interact:  for  example,  a  reaction  temperature  measurement  must  be  completed  in  time  to  apply  a 
correction  in  a  manufacturing  system.  Stated  another  way,  the  value  to  the  system  for  completing  a 
process  with  a  deadline  is  some  high  constant  value  which  abruptly  drops  to  zero  after  the  deadline. 

Real-time  systems  have  been  divided  into  two  classes;  hard  real  time  systems  and  soft  real-time 
systems®.  Hard  real-time  systems  are  those  whose  deadlines  must  absolutely  be  met  or  the  system 
will  be  considered  to  have  failed  (and  whose  failure  might  be  catastrophic),  while  soft  real-time 
systems  allow  for  some  deadlines,  at  least  occasionally,  to  be  missed  with  only  a  degradation  in 
performance  but  not  a  complete  failure. 

Many  existing  real-time  systems  are  being  used  in  a  number  of  such  diverse  environments  as  space  or 
airborne  platform  management,  factory  process  control,  and  robotics.  These  involve  several  levels  of 
real-time,  characterized  by  the  tightness  of  their  deadlines,  and  range  from  closed-loop 
sensor/actuator  systems  (frequently  implemented  on  microprocessors  dedicated  to  a  small  number 
of  devices)  to  supervisory  systems  involving  human  interaction  to  manage  a  complex  command  and 
control  environment. 

1 .3.  Scheduling  in  Existing  Real-Time  Systems 

In  observing  a  number  of  existing  real  time  systems,  we  note  the  characteristics  found  in  the 
operating  systems  used  managing  their  manage  resources  in  support  of  their  real  time  operation 
(such  operating  systems  are  usually  called  executive:  since  a  full  operating  system  is  not  usually 
implemented  in  a  real  time  system).  Two  primary  characteristics  of  these  operating  systems  can  be 
described: 

•  These  operating  systems  are  kept  simple,  with  minimal  overhead,  but  also  with  minimal 


function.  Virtual  storage  is  almost  never  provided:  file  systems  are  usually  either 
extremely  limited  or  non-existent.  I/O  support  is  kept  to  an  absolute  minimum. 
Scheduling  is  almost  always  provided  by  some  combination  of  FIFO  (for  message 
handling),  fixed  priority  ordering,  or  round-robin,  with  the  choice  made  by  the  operating 
system  designer. 

•  Simple  support  for  management  of  a  hardware  real-time  clock  is  provided,  with  facilities 
for  periodic  process  scheduling  based  on  the  clock,  and  timed  delay  primitives. 

Conspicuously  missing  from  these  systems  at  the  operating  system  level  is  any  support  for  explicitly 
managing  user-defined  deadlines,  even  though  meeting  such  deadlines  is  the  primary  characteristic 
of  the  real-time  application  requirements.  Instead,  these  systems  are  designed  to  meet  their 
deadlines  by  attempting  to  ensure  that  the  available  resources  exceed  the  expected  worst-case  user 
requirements,  and  their  implementation  is  followed  by  an  extensive  testing  period  in  an  attempt  to 
verify  that  these  requirements  can  be  met  under  these  expected  loads. 

In  fixed  priority  scheduler  systems,  deadline  management  is  attempted  by  assigning  a  high  fixed 
priority  to  processes  with  "important"  deadlines,  disregarding  the  resulting  impact  to  less 
"important"  deadlines.  During  the  testing  period,  these  priorities  are  (usually  manually)  adjusted  until 
the  system  implementer  is  convinced  that  the  system  "works".  This  approach  can  work  only  for 
relatively  simple  systems,  since  the  fixed  priorities  do  not  reflect  any  time-varying  value  of  the 
computations  with  respect  to  the  problem  being  solved,  nor  do  they  reflect  the  fact  that  there  are 
many  schedulable  sets  of  process  deadlines  which  cannot  be  met  with  fixed  priorities.  In  addition, 
implementers  of  such  systems  find  that  it  is  extremely  difficult  to  determine  reasonable  priorities, 
since,  typically,  each  individual  subsystem  implementer  feels  that  his  or  her  module  is  of  high 
importance  to  the  system.  This  problem  is  usually  "solved"  by  deferring  final  priority  determination  to 
the  system  test  phase  of  implementation,  so  the  resulting  performance  problems  remain  hidden  until  it 
is  too  late  to  consider  the  most  effective  design  solutions.  This  approach  results  not  only  in  real-time 
systems  with  frequently  marginal  performance,  but  also  in  extremely  fragile  systems  in  the  presence 
of  changing  requirements. 

A  scheduling  algorithm  which  is  seldom  used  in  practice  has  some  interesting  properties,  but  also 
presents  a  serious  pitfall.  A  deadline  scheduler  (i.e.,  a  scheduler  which  schedules  the  process  with 
the  closest  deadline  first®)  solves  the  problem  of  missing  otherwise  schedulable  deadlines  due  to  the 
imposition  of  fixed  priorities,  but  leaves  other  problems,  most  notably  transient  overloads.  Transient 
overloads  occur  not  only  as  a  result  of  application  and  operating  system  design  decisions,  but  also  as 
a  consequence  of  normal  system  activity,  including  interrupt  processing,  DMA  cycle  stealing,  or 
blocking  on  semaphores^®.  The  deadline  scheduler  provides  no  reasonable  control  over  the  choice 


310 


of  which  Headlines  must  be  delayed  in  an  overload,  leading  to  unpredictable  failures  and  resulting  in 
an  impact  on  reliability  and  maintainability. 

The  value  of  a  real  process  completion  is  well  modeled  by  a  step  function  only  in  those  few  cases  in 
which  m  which  there  is  no  longer  any  value  in  completing  the  process  after  its  deadline,  such  as 
computing  a  control  parameter  for  a  manufacturing  step  which  has  already  been  completed.  In  many 
actual  systems,  the  value  of  completing  a  computation  after  its  deadline  may  rapidly  decay  but  remain 
positive  (e.g.,  being  late  on  an  aircraft  navigation  update  may  result  in  a  loss  of  positional  accuracy, 
while  missing  it  altogether  would  merely  exacerbate  the  loss  unless  it  became  so  late  that  it  impacted 
making  the  next  update),  or  completing  a  computation  before  its  deadline  may  be  more  or  less 
desirable  than  completing  it  exactly  at  its  deadline  (e.g.,  a  satellite  orbital  insertion  burn  computation, 
which  must  occur  within  a  small  time  "window"  around  an  optimal  time).  As  we  show  experimentally, 
process  scheduling  in  the  presence  of  a  (possibly  transient)  overload  shows  particularly  well  the 
difficulty  in  producing  a  reasonable  schedule  with  respect  to  different  process’  vaiue  functions. 
Determining  an  appropriate  value  for  a  given  process  consists  not  only  of  simply  assigning  a  priority, 
but  rather  of  defining  the  system  value  of  completing  it  at  any  time.  This  vaiue  is  normally  defined  by 
the  physical  application  environment  which  the  system  must  serve. 

1 .4.  Evaluating  a  Real-Time  Scheduler 

Given  that  the  primary  characteristic  of  a  real-time  system  is  that  correctness  is  determined  not  only 
by  what  is  done,  but  when  it  is  done,  we  propose  to  use  a  representation  of  a  process  completion 
value  to  measure  the  algorithms  commonly  used  or  considered  for  use  in  a  real-time  system.  In 
particular,  we  will  simulate  a  real-time  system  in  which  each  process  has  a  value  expressed  as  a 
function  of  time  which  defines  the  value  to  the  system  of  completing  that  process  at  any  time,  emd  we 
will  "reward"  the  system  with  the  value  determined  by  that  function  when  the  process  terminates.  A 
set  of  processes  which  arrive  at  stochastically  determined  intervals  (regular  intervals  for  periodic 
processes  and  Poisson  arrivais  for  aperiodic  processes)  and  compete  for  a  processing  resource  in  a 
symmetric  multiprocessor  will  then  be  executed  for  a  given  period  of  time.  The  sum  of  the  resulting 
process  vaiues  for  all  processes  requested  during  the  execution  period  will  then  provide  our  metric 
for  determining  the  performance  of  each  scheduling  algorithm  under  several  conditions  of  loads,  and 
with  several  different  value  functions.  This  metric  measures  the  long-term  (relative  to  the  individual 
process  computation  times)  performance  of  the  system  with  respect  to  its  support  of  the  application- 
defined  value  of  meeting  time  constraints,  a  direct  measure  of  the  application  real-time  support 
provided. 


311 


2.  Time-driven  Scheduling  Model 

We  define  a  computational  model  consisting  of  a  set  of  preemptible  processes  P  resident  in  a 
computer  with  a  single  shared  memory  and  one  or  more  processing  elements.  Each  process  p,  has  a 
request  time  R.,  an  estimated  computation  interval  Cj,  and  a  value  function  V/W,  where  f  is  a  time  for 
which  the  value  is  to  be  determined^.  Figure  2-1  illustrates  these  process  attributes  for  a  hypothetical 
process  whose  value  function  is  linearly  decreasing  prior  to  a  critical  time  O,  with  an  exponential  value 
decay  following  D^.  The  process  illustrated  depicts  a  process  which  has  been  dispatched  shortly  after 
its  request  time  and  which  has  compieted  prior  to  its  critical  time  without  being  preempted. 


R,  D, 

Figu  re  2- 1 :  Process  Model  Attributes  for  Process  I 


Vi(t)  defines  the  value  to  the  system  for  completing  P^  at  time  t.  The  nature  of  V,  is  determined  by  the 
range  of  scheduling  policies  supported. by  the  operating  system,  particularly  with  respect  to  the 
handling  of  a  processor  overload  in  which  some  critical  times  cannot  be  met.  The  value  function 
could  come  from  either  or  both  of  two  sources:  the  process  implementer  who  understands  the 
requirements  of  the  internal  environment  as  it  affects  that  process,  and  the  system  architect  who  is 
aware  of  the  relationship  between  that  process  and  the  overall  system,  including  its  external 
environment.  For  the  purposes  of  this  paper,  we  assume  that  all  sources  of  value  information  are 
reflected  in  the  value  functions  used. 

We  note  that  the  existence  and  importance  of  a  deadline  for  p,  is  dependent  on  nature  of  its  value 
function.  The  value  function  can  be  said  to  define  a  critical  time  (also  called  a  deadline  if  the  function 
is  a  step  function)  only  if  it  has  a  discontinuity  in  the  function  or  its  first  or  second  derivative. 
Functions  of  this  type  are  characterized  by  a  relatively  rapid  change  in  value  with  respect  to  time,  with 
the  deadline  case  only  an  extreme  example.  For  example,  the  process  illustrated  in  Figure  2-1,  has  a 
critical  time  defined  by  the  discontinuity  in  its  first  derivative  at  D^.  For  convenience  in  the  discussion 
of  the  simulator,  we  will  actually  refer  to  a  critical  time  as  the  time  axis  origin  relative  to  which  the 
value  function  for  a  process  will  be  computed,  even  if  the  function  has  no  discontinuity  as  described 
above. 


312 


The  use  of  value  functions  as  described  in  this  model  allows  us  to  describe  both  hard  and  soft 
real-time  environments,  and,  in  particular,  allows  us  to  evaluate  systems  which  mix  deadlines  of  both 
types  in  a  single  system.  Since  the  value  functions  used  in  this  model  may  or  may  not  explicitly  define 
either  a  critical  time  or  a  deadline,  and  the  critical  time  may  not  actually  constitute  a  deadline,  in  the 
remainder  of  this  paper  we  will  avoid  the  use  of  the  word  "deadline",  which  implies  the  step  value 
function.  Hence,  we  will  refer  to  the  time  of  the  value  function  discontinuity,  if  any,  as  its  critical  time. 

The  request  time  /?,  has  been  defined  in  our  simulation  (see  Section  3)  as  an  arbitrary  time  at  which  P, 
has  been  requested  to  be  executed.  The  significance  to  the  scheduler  as  we  have  defined  it  is  that 
the  process  is  not  schedulable  prior  to  P,.  Following  P,,  it  remains  schedulable  until  one  of  the 
following  conditions  has  occurred: 

•  It  has  completed, 

•  Its  value  function  has  become  zero  or  negative. 

We  note  that  the  value  function  may  be  negative  at  P^,  not  rising  above  zero  until  a  later  time  (see 
Section  3  for  an  example  of  such  a  function). 

Thus,  as  viewed  by  the  process  scheduler,  the  request  time  P,  for  a  process  may  be  either  a  future  or 
a  past  time.  If  P^  is  a  future  time,  the  process  is  not  currently  schedulable,  but  its  attributes  may  be 
considered  in  the  current  computations  of  load  from  which  current  scheduiing  decisions  are  made. 

The  computation  time  C,  is  a  random  variable  representing  the  expected  execution  time  to  process  P, 
(estimated  time  remaining  if  P^  has  already  begun  processing),  not  including  system  overhead  or 
preemptions.  The  source  of  this  value  and  its  distribution  for  a  real  operating  system  would  be  either 
the  programmer  or  an  actual  measurement  by  the  system  itself  (see  Section  4),  but  here  we  wiil  simply 
assume  that  it  is  available  to  the  scheduler.  Clearly,  the  value  function  can  be  seen  as  a  definition  of 
application  scheduling  policy,  and  although  in  a  real  system  not  every  policy  would  be  subsumed  in  a 
single  set  of  value  functions,  in  this  paper,  we  will  assume  that  the  value  function  completely  defines 
the  policies  to  be  implemented  by  the  scheduler.  The  relationship  between  our  value  functions  and 
the  global  scheduling  policy  is  a  subject  of  continuing  research  in  this  effort. 

Given  the  potentially  diverse  set  of  value  functions  which  could  be  produced  in  a  given  system,  we  will 
show  that  none  of  the  normally  used  scheduling  algorithms  can  consistently  produce  a  good 
schedule  as  the  processing  resources  approach  saturation  (see  Section  3).  As  an  example,  see 
Figure  2-2  for  an  example  of  four  processes  in  a  single  proce.  sor  with  value  functions,  for  which  the 
best  choice  of  an  execution  sequence  is  non-trivial.  The  figure  shows  the  four  value  functions,  and  a 


potential  scheduling  sequence  such  that  each  completes  with  a  high  value.  We  will  describe  later  the 
significance  of  such  value  functions  in  the  discussion  of  the  experiments  conducted. 


Figure  2-2:  Four  "Typical"  Processes  with  Value  Functions 

This  model  encompasses  both  periodic  and  aperiodic  processes  in  that  any  single  execution  of  a 
periodic  process  can  be  described  as  shown  above  in  exactly  the  same  way  as  for  a  non-periodic 
process.  The  only  difference  is  that  the  future  request  times  for  a  periodic  process  are  known  in 
advance.  We  allow  overlapping  executions  by  allowing  multiple  instances  of  a  process  to  be 
simultaneously  schedulable  (we  assume  that  processes  are  reentrant). 

It  should  be  noted  that  precedence  and  consistency  (e.g.,  processes  involved  in  mutual  exclusion) 
relations  are  not  addressed  in  this  model  at  this  time,  since  it  has  been  assumed  that  the  scheduling 
algorithm  will  consider  only  processes  which  are  ready  to  be  executed  and  for  which  a  request  time 
has  been  determined.  If  one  process  is  dependent  upon  completion  of  another,  the  second  process' 
request  time  will  not  have  been  determined  until  the  first  process  has  been  completed. 

3.  Simulation  Results 

3.1 .  Real-Time  Scheduling  Simulator 

To  provide  an  environment  in  which  various  scheduling  algorithms  can  be  evaluated  with  respect  to 
their  performance  in  generating  a  high  total  system  value,  a  simulator  has  been  constructed  which 
provides  an  operating  system  environment  in  which  each  of  the  scheduling  algorithms  can  be 
executed  and  evaluated. 


The  value  functions  to  be  used  by  the  simulator  are  not  competely  general  forms,  but  rather  are 
limited  to  a  specific  form  possessing  characteristics  making  the  expressiveness  of  the  value  very 


314 


flexible  while  allowing  the  analysis  of  the  resulting  functions  to  be  as  tractable  as  possible.  We  define 
the  value  function  in  two  parts;  the  value  prior  to  the  critical  time  and  the  value  after  the  critical  time, 
providing  for  a  discontinuity  at  the  critical  time  if  desired  by  the  application  designer.  For  each  of 
these  two  parts,  five  constants  are  used  to  define  the  value  function  relative  to  the  critical  time  using 
the  expression: 

K//) = K/  +  K/'W 

The  simulator  first  uses  a  statistical  model  of  a  "typical"  real-time  processor  load  to  generate  a  set  of 
processes,  including  both  periodic  and  aperiodic  processes.  The  statistical  model  defines  a  real-time 
workload  based  on  experience  with  several  actual  real-time  systems  with  which  this  author  is  familiar. 
The  processor  load  generated  can  be  varied  to  simulate  both  heavily  loaded  systems  and  relatively 
lightly  loaded  systems.  The  nature  of  this  statistical  load,  including  the  process  characteristics,  the 
simulated  hardware  characteristics,  and  the  value  functions,  can  be  easily  modified  during  the 
experimentation  allowing  us  to  evaluate  a  number  of  systems  with  different  characteristics.  It  is  also 
possible  to  impose  an  additional  "spike"  load  in  which  the  aperiodic  arrival  rate  dramatically 
increases  at  certain  periods  during  the  simulation,  allowing  us  to  determine  scheduler  performance 
during  transient  overloads.  Our  test  process  sequences  will  emphasize  the  overloaded  condition, 
since  that  is  the  condition  of  primary  interest. 

Once  a  set  of  processes  has  been  constructed,  the  simulator  "executes"  the  processes,  applying  a 
statistical  model  to  generate  the  actual  request  times  and  computation  times  for  each  process. 
Execution  consists  of  iteratively  calling  the  selected  scheduler  for  its  decisions,  "executing"  the 
selected  process,  updating  the  clock  to  the  next  important  time  (e.g.,  time  of  arrival  of  a  new  process), 
then  asking  the  scheduler  for  its  next  decision. 

3.1 .1 .  A  Set  of  Classical  Algorithms 

The  simulator  has  the  ability  to  make  multiple  runs  sequentially  using  an  identical  set  of  processes 
with  identical  values  and  request  times,  using  various  "classical"  algorithms.  These  "classical" 
algorithms  include: 

1.  SPT.  At  each  decision  point,  the  process  with  the  shortest  estimated  completion  time  is 
executed. 

2.  Deadline.  At  each  decision  point,  the  process  with  the  earliest  critical  time  (interpreted 
by  this  algorithm  as  a  deadline)  is  executed. 

3.  Slack.  At  each  decision  point,  the  process  with  the  smallest  estimated  slack  time 
(elapsed  time  to  the  critical  time  minus  its  estimated  completion  time)  is  executed. 


315 


4.  FIFO.  At  each  decision  point,  the  process  which  has  been  in  the  request  set  longest  is 
executed. 

5.  Random.  At  each  decision  point,  a  process  is  chosen  (with  uniform  probability)  from  the 
request  set  and  executed. 

6.  RandPRTY.  At  each  decision  point,  the  process  with  the  highest  fixed  priority  is 
executed.  In  the  runs  reported,  this  fixed  priority  was  chosen  to  be  the  same  as  the 
highest  value  reached  by  the  corresponding  value  function. 

The  simulator  keeps  a  large  number  of  statistics  on  each  run,  including  such  parameters  as  the 
number  of  tardy  processes,  the  maximum  lateness,  the  average  lateness,  and  the  total  value 
accumulated.  This  data  is  saved  and  reported  for  each  real-time  run  made  in  a  simulation  sequence, 
and  is  reported  in  a  statistical  summary  at  the  end  of  the  simulation,  resulting  in  the  data  provided 
below. 

3.1.2.  A  Pair  of  Interesting  New  Algorithms 

In  addition  to  thes"'  algorithms,  two  experimental  algorithms  have  been  implemented,  and  have 
produced  some  initially  promising  results  which  we  include  in  the  experiments  demonstrated  here. 
An  important  part  of  our  research  effort  consists  of  the  further  development  of  algorithms  such  as 
these,  and  the  concepts  behind  them,  as  they  apply  to  a  real-time  system. 


In  these  initial  algorithms,  we  take  advantage  of  three  observed  value  function  and  scheduling 
characteristics: 

1.  Given  a  set  of  processes  (ignoring  deadlines)  with  known  values  for  completing  them,  it 
can  be  shown  that  a  schedule  in  which  the  process  with  the  highest  value  density  {V/C,  in 
which  V  is  its  value  and  C  is  its  processing  time)  is  processed  first  (i.e.,  a  Value  Density 
Schedule)  will  produce  a  total  value  at  every  point  in  time  at  least  as  high  as  any  other 
schedule. 

2.  Given  a  set  of  processes  with  deadlines  which  can  all  be  met  (based  on  the  sequence  of 
the  deadlines  and  the  computation  times  of  the  processes),  it  can  be  shown  that  a 
schedule  in  which  the  process  with  the  earliest  deadline  is  scheduled  first  (i.e.,  a  Deadline 
schedule)  will  always  result  in  meeting  all  deadlines. 

3.  Most  value  functions  of  interest  (at  least  among  those  investigated  at  this  time)  have  their 
highest  value  occurring  immediately  prior  to  the  critical  time. 


Some  of  the  implications  of  these  observations  are: 

•  If  no  overload  occurs,  the  deadline  schedule  will  have  all  deadlines  met,  and  no  higher 
total  value  will  be  possible. 

•  If  an  overload  occurs,  and  some  processes  must  miss  their  deadlines,  the  Value  Density 
Schedule  would  produce  a  high  value,  but  might  miss  some  deadlines  which  could 


316 


otherwise  have  been  met. 

Our  first  algorithm  (called  the  BEValuel  algorithm  in  the  experimentation  description,  section  3.2) 
exclusively  uses  observation  1  above,  and  is  therefore  a  simple  greedy  algorithm  scheduling  first  the 
process  with  the  highest  expected  value  density.  This  algorithm  actually  performs  reasonably  well  in 
many  cases  in  which  the  value  function  is  a  step  function  or  rapidly  decreasing  following  the  critical 
time,  in  spite  of  the  fact  that  it  makes  no  explicit  use  of  the  critical  time  itself.  The  critical  time  does,  of 
course,  enter  the  algorithm  through  the  expected  value  computation,  which  uses  the  value  function 
and  the  assumed  computation  time  distribution  to  compute  the  expected  value.  This  algorithm  fails 
most  notably  in  step  function  situations  in  which  no  overload  is  present  and  a  number  of  processes 
with  close  deadlines  are  in  the  request  set.  In  this  case,  the  processes  with  high  value  density  will  be 
run,  quite  possibly  preventing  other  processes  from  meeting  their  deadlines  even  though  all  deadlines 
could  have  been  met. 

Our  second  algorithm  is  a  modification  of  the  simple  deadline  algorithm,  attempting  to  remove  its 
most  important  failing:  in  an  overload  the  deadline  scheduler  will  give  priority  to  processes  whose 
deadlines  cannot  possibly  be  met,  delaying  other  processes  which  could  still  meet  their  deadlines. 
Therefore,  after  computing  the  probability  of  an  overload,  we  choose  processes  with  low  value 
density  as  candidates  for  being  removed  from  an  overloaded  deadline  schedule  until  a  deadline 
schedule  is  produced  which  has  an  acceptably  low  probability  of  producing  an  overload. 

This  algorithm  (called  the  BEValue2  algorithm  in  the  experimentation  description,  section  3.2),  starts 
with  a  deadline-ordered  sequence  of  the  available  processes,  which  is  then  sequentially  checked  for 
its  probability  of  overload.  At  any  point  in  the  sequence  in  which  the  overload  probability  passes  a 
preset  threshold,  the  process  prior  to  the  overload  with  the  lowest  value  density  will  be  removed  from 
the  sequence,  repeating  until  the  overload  probability  is  acceptable.  This  algorithm  seems  generally 
to  outperform  BEValuel,  since  it  always  meets  deadlines  as  long  as  no  overload  occurs,  and 
transitions  gradually  into  the  same  performance  as  BEValuel  as  an  overload  condition  worsens.  In 
this  algorithm,  modifications  to  the  overload  probability  threshold  can  significantly  affect  its  overload 
performance;  at  extremes,  a  threshold  of  100%  results  in  a  pure  deadline  schedule,  while  a  probability 
threshold  of  0%  results  In  a  BEValuel  schedule.  For  the  experiments  described  here,  a  threshold  of 
40%  is  used.  Optimizing  this  threshold  as  well  and  the  other  critical  values  used  in  the  heuristics  from 
which  BEValue2  is  constructed  are  goals  of  our  continuing  research  in  this  area. 


317 


3.2.  Experiments  Executed 

To  test  our  hypothesis  that  the  efficacy  of  a  scheduling  algorithm  in  a  real-time  system  can  bo 
measured  with  respect  to  the  key  distinguishing  characteristic  of  such  systems,  namely  that 
completing  a  process  has  a  time-varying  value  to  the  system,  we  have  generated  a  set  of  simulation 
experiments  using  several  well-known  scheduling  algorithms.  As  with  any  such  set  of  measurements, 
this  set  is  necessarily  incomplete,  but  is  intended  to  illustrate  some  of  the  characteristics  which  would 
be  observable  with  systems  under  several  potential  conditions. 

Each  experiment  described  here  was  run  on  a  simulated,  symmetric,  shared-memory  multiprocessor, 
with  the  number  of  processing  elements  varying  from  one  to  four  (thus  testing  the  schedulers  under 
four  load  levels).  All  processes  were  considered  to  be  fully  preemptible  and  the  effects  of  context 
switching  and  scheduling  overhead  were  neglected.  Thirty-six  processes  were  selected,  each  with  an 
individually  determined  stochastic  load  assumed  to  be  normally  distributed  and  characterized  by  its 
mean  and  standard  deviation.  The  mean  loads  of  these  processes  were  themselves  approximately 
normally  distributed  across  the  36  processes,  with  a  mean  of  500  ms.  and  a  standard  deviation  of  300 
ms.,  with  the  limitation  that  the  minimum  process  load  was  1  ms.  Figure  3-1  lists  the  actual  set  of 
processes,  showing  for  each  its  process  ID,  its  period  (if  it  is  periodic),  its  load  characteristics,  and  its 
critical  time. 

As  each  simulation  progressed,  each  process  from  this  list  was  requested  at  either  the  periodic  rate,  if 
it  was  a  periodic  process,  or  using  an  exponential  interarrival  time  with  a  mean  of  10  seconds.  In 
addition  to  these  requests,  additionai  aperiodic  processes  arrived  as  a  "spike"  load  with  a  mean 
interarrival  time  of  1  second,  exponentially  distributed,  every  10  seconds,  with  the  "spike"  lasting  200 
ms.  This  simulated  set  of  requests  continued  for  60  seconds  of  elapsed  time,  at  the  end  of  which  the 
requests  ended  and  the  unfinished  processes  in  the  request  set,  if  any,  were  completed  by  the 
multiprocessor,  terminating  the  simulation  when  the  queue  was  empty.  At  the  completion  of  each 
process,  its  value  was  computed  and  the  values  from  all  completed  processes  are  summed, 
producing  a  final  total  value  over  the  60  second  request  interval.  No  value  was  accrued  for  processes 
not  competed  (i.e.,  aborted).  As  the  list  in  Figure  3-1  shows,  four  of  the  processes  were  periodic:  the 
total  system  load  from  these  four  processes  represented  about  42%  of  the  total  process  load.  This 
simulation  resulted  in  the  arrival  of  about  305  processes,  including  both  periodic  and  aperiodic 
processes. 

Four  value  functions,  illustrated  in  Figure  3-2,  were  used  in  separate  executions  to  compute  the  total 
value  generated  by  each  of  the  scheduling  algorithms  under  differing  conditions  of  perceived  system 
value. 


318 


ID 

Period 

Mean 

Load 

Load 

a 

Critical 

Time 

1 

0.474 

0.058 

0.683 

2 

0.045 

0.009 

0.128 

3 

0.580 

0.159 

1.746 

4 

0.516 

0.141 

1.267 

5 

0.492 

0.132 

1.069 

6 

0.133 

0.034 

0.193 

7 

0.627 

0.129 

1.402 

8 

0.251 

0.044 

0.444 

9 

0.859 

0.271 

1.894 

10 

0.520 

0.181 

0.762 

11 

0.574 

0.080 

1.405 

12 

0.217 

0.027 

0.674 

13 

0.920 

0.166 

2.752 

14 

0.737 

0.135 

1.749 

15 

1.731 

0.247 

0.069 

0.618 

16 

0.515 

0.067 

1.159 

17 

0.748 

0.345 

2.160 

18 

0.696 

0.248 

1.390 

19 

5.762 

0.782 

0.229 

1.280 

20 

0.555 

0.088 

1.422 

21 

0.805 

0.197 

2.490 

22 

0.051 

0.003 

0.148 

23 

0.684 

0.166 

1.583 

24 

11.119 

0.827 

0.215 

1.932 

25 

0.408 

0.133 

1.254 

26 

0.001 

0.000 

0.002 

27 

0.894 

0.185 

2.126 

28 

0.754 

0.166 

2.119 

29 

0.387 

0.099 

1.166 

30 

0.658 

0.141 

1.606 

31 

0.249 

0.103 

0.562 

32 

0.459 

0.112 

1.409 

33 

0.650 

0.100 

1.788 

34 

0.498 

0.152 

0.714 

35 

3.663 

0.245 

0.045 

0.618 

36 

0.130 

0.041 

0.322 

Figu  re  3- 1 :  Simulation  Experiment  Process  List 


The  first  (V1),  showing  a  constant  value  prior  to  the  critical  time  followed  by  an  exponential  decay 
after  the  deadline,  represents  a  case  such  as  a  vehicle  navigation  problem  in  which  a  position  update 
must  be  completed  by  a  given  time  each  cycle  to  provide  a  specified  precision,  but  if  late,  the 
resulting  error  can  be  minimized  by  making  up  the  computation,  assuming  the  next  navigation  cycle  is 
not  overrun.  As  the  next  cycle  approaches,  the  value  of  running  the  previous  one  becomes  negligibly 
small.  The  second  (V2)  represents  a  hard  deadline  process,  in  which  there  is  value  to  the  system  for 
completing  it  only  if  it  precedes  a  fixed  deadline,  such  as  for  certain  sensor  inputs,  in  which  the  value 
to  be  read  is  no  longer  present  if  a  fixed  time  interval  is  exceeded.  The  third  value  function  (V3)  is 


319 


Figure  3-2:  Example  Value  Functions 

constant  prior  to  the  deadline,  but  drops  off  parabolically  after  the  deadline,  representing  a  process 
similar  to  the  hard  deadline  above,  but  with  a  softer  deadline  as  long  as  the  process  is  not  excessively 
late  (in  this  experiment,  the  value  function  drops  past  zero  at  about  516  ms.  following  the  critical 
time).  The  fourth  function  (V4)  is  parabolic  both  before  and  after  the  critical  time,  representing  a 
process  whose  completion  should  be  delayed  until  after  some  arbitrary  time,  such  as  a  satellite 
launch,  in  which  the  desired  orbit  cannot  be  reached  prior  to  a  given  time  or  after  a  later  time  (in  this 
experiment,  these  functions  pass  zero  approximately  51 6  ms.  before  and  after  the  critical  time). 

In  an  actual  system,  it  would  be  expected  that  each  process  in  a  given  application  would  have  a 
distinct  value  function,  but  we  have  used  identically  shaped  value  functions  for  every  process  in  the 
run  to  isolate  the  effects  of  multiple  value  function  shapes  on  scheduling  quality.  For  each  of  the  four 
value  functions  chosen,  four  experiments  were  conducted,  with  1,  2,  3,  and  4  processing  elements 
each,  representing  average  process  loads  of  246%,  123%,  82%,  and  61%,  respectively,  and  each 
simulation  was  repeated  15  times,  averaging  the  computed  parameters.  For  each  algorithm  in  each 
experiment,  several  parameters  were  computed: 

•  Total  Value(V1,  V2,  V3,  V4)  ■■  The  sum  of  each  of  the  four  value  functions  for  each 
process  computed  at  its  completion  time,  averaged  over  the  15  runs. 

•  Mean  Preemptions(MP)  ■■  The  number  of  times  a  running  process  is  preempted  by  a  more 
criticai  process. 


320 


•  Mean  Lateness(ML)*  The  difference  between  the  actual  process  completion  time  and 
its  critical  time,  averaged  over  the  15  runs.  This  value  is  negative  if  the  process 
compietes  prior  to  its  critical  time,  and  positive  after  the  critical  time. 

•  Maximum  Lateness(MaxL)  ••  The  largest  lateness  value  of  the  entire  60  second  run, 
averaged  over  the  15  runs  for  which  each  simulation  was  repeated. 

•  Mean  Tardiness(MT)  The  average  of  the  non-negative  latenesses  over  the  15  runs. 

•  Number  of  Tardy  ProcessesfNTP)  -  The  total  number  of  processes  completing  after  their 
critical  times,  averaged  over  the  1 5  runs. 

and  the  results  of  this  measurement  for  the  standard  algorithms  are  graphically  represented  in  Figure 
3-3,  while  the  more  specialized  BEValuel  and  BEValue2  algorithms  are  represented  in  figure  3-4.  In 
these  figures,  each  circle  represents  a  particular  scheduling  algorithm  executing  under  a  particular 
load.  Within  each  circle,  each  line  represents  one  of  the  measurements  described  above,  with  a 
longer  line  indicating  that  the  corresponding  measurement  was  one  of  the  best  for  that  load  value, 
while  a  short  line  indicates  that  the  corresponding  measurement  was  relatively  poor.  Although  these 
measures  are  all  reported  for  each  set  of  runs,  the  most  important  measures  from  our  perspective  are 
the  four  value  function  results,  VI ,  V2,  V3,  and  V4. 

Processes  whose  value  has  passed  below  zero  are  aborted,  and  the  resulting  system  value  is  zero, 
reflecting  the  fact  that  they  produce  no  contribution  to  the  system  performance.  This  abortion  occurs 
for  all  algorithms  tested. 

3.3.  Experimental  Results  from  Standard  Scheduling  Algorithms 

There  are  a  few  observations  which  can  be  made  from  these  measurements  across  all  runs.  It  should 
be  noted  that  in  no  case  did  the  FIFO  scheduler  or  the  random  scheduler  generate  the  best  schedule, 
although  they  produced  acceptable  schedules  when  the  system  was  sufficiently  underloaded;  it  is 
clear  that  neither  of  these  would  be  the  algorithm  of  choice  unless  we  were  guaranteed  that  an 
overload  condition  could  not  occur.  Both  of  these  algorithms  are  used  extensively  in  certain  types  of 
actual  systems;  FIFO  is  used  for  many  message  passing  schedulers,  and  random  scheduling  is 
effectively  used  for  some  contention  bus  communications  links  (e.g.,  Ethernet).  Similarly,  the  fixed 
priority  scheduler  performed  inconsistently,  particularly  as  the  overload  condition  became  more 
pronounced.  Thus,  of  these  algorithms,  the  choice  wouid  be  between  the  SPT  and  Deadline 
scheduling  algorithms,  neither  of  which  are  widely  used  in  existing  real-time  systems. 


'Value  functions  resulting  in  aborting  processes  (i.e.,  all  the  value  functions  used  in  the  experiments  reported  here  except 
the  exponential  decay  function  (VI])  produce  meaningless  values  for  the  lateness  and  tardiness,  so  the  reported 
measurements  for  these  are  included  only  for  runs  \«hich  result  in  no  abortions. 


322 


With  respect  to  the  observed  value  variations  among  the  different  processor  loads,  the  first 
observation  is  that,  except  for  the  fourth  value  function  (\/4)  runs  (parabolic  value  functions  both  prior 
to  and  following  the  critical  time),  the  scheduling  function  makes  little  or  no  difference  when  the 
system  is  lightly  loaded.  Although  in  the  underloaded  case  the  deadline  scheduler  generally  shows 
the  best  performance  in  most  of  the  relevant  measured  parameters,  the  difference  between  the  best 
and  worst  among  the  six  algorithms  with  respect  to  total  value  is  within  about  5%  on  value  functions 
VI ,  V2,  and  V3.  As  expected,  we  note  that  the  insensitivity  to  the  algorithm  is  most  pronounced  at  the 
average  load  of  61%,  and  less  so  at  82%,  where  the  preference  for  the  deadline  and  slack  time 
schedulers  are  more  pronounced  (and  the  difference  between  the  best  and  worst  total  values)  is 
about  12%.  This  result  confirms  our  intuition  that  the  scheduling  algorithm  used  for  a  suitably 
underloaded  system  is  not  usually  an  extremely  critical  factor.  The  case  with  value  function  (V4)  is 
radically  different  in  that  the  value  functions  indicated  that  the  processes  should  have  had  their 
executions  delayed  to  achieve  acceptable  values,  but  none  of  these  algorithms  took  this  effect  into 
account.  Such  an  effect  is  typically  overcome  in  existing  real-time  systems  by  the  application 
processes  themselves  introducing  delays  in  their  processing,  and  relying  on  priority  to  force  the 
delayed  process  to  execute  properly  in  its  value  function  "window”. 

The  piciure  becomes  more  complex  as  the  load  increases  to  an  c/erload  condition.  At  123% 
overload,  the  deadline  scheduler  is  beginning  to  have  considerable  difficulty  with  VI,  V2,  and  V3,  and 
the  SPT  scheduler  seems  to  be  performing  somewhat  better.  Even  the  fixed  priority  scheduler  is 
doing  better  than  the  deadline  scheduler.  This  effect,  while  perhaps  not  intuitive,  can  be  explained  by 
noting  that  the  SPT  algorithm  is  designed  to  complete  the  largest  possible  number  of  processes  in  a 
given  time  interval,  increasing  the  likelihood  that  the  short  processes  will  complete  prior  to  their 
deadlines  and  resulting  in  a  greater  number  of  processes  doing  so,  therefore  producing  a  higher  total 
value.  The  deadline  scheduler,  white  ensuring  that  deadlines  will  be  met  if  there  is  sufficient 
processing  capacity,  fails  to  satisfactorily  handle  an  overload  due  to  its  propensity  for  giving  priority 
to  processes  which  are  about  to  miss  their  deadlines  over  those  whose  deadlines  are  stilt  to  come. 

3.4.  Experimental  Results  from  Value  Function  Algorithms 

Of  the  two  experimental  value  function  algorithms  described  above,  it  is  clear  that  the  BEValue2 
scheduler  is  a  clear  winner  as  shown  in  Figure  3-4,  plotted  with  identical  scales  as  Figure  3-3,  using 
the  identical  value  functions  and  loads.  It  outperforms  all  other  tested  algorithms  in  all  the  overload 
runs  by  significant  margins,  and  shows  a  consistency  of  performance  which  none  of  the  other 
algorithms  can  approach.  Like  the  others  tested,  it  had  some  difficulties  with  the  V4  function,  since  it, 
too,  takes  little  account  of  the  fact  that  the  processes  needed  to  be  delayed  to  achieve  a  high  value. 


Figure  3*4:  Value  Function  Experiment  Results 

and  this  is  one  of  the  critical  areas  in  which  further  research  will  be  performed.  These  results  should 
be  viewed  as  preliminary  at  best,  but  it  is  certainly  clear  that  it  has  some  important  characteristics 
which  would  be  needed  in  a  value  function  scheduler. 

The  BEValuel  scheduler  performance  was,  like  many  of  the  other  algorithms,  much  more  variable, 
depending  on  the  value  function  and  load.  Because  it  is  a  "greedy"  algorithm,  preferring  to  pick  up 
value  early  rather  than  wait  to  get  a  higher  value,  it  misses  many  opportunities  to  meet  time 
constraints. 

3.5.  Experimentation  Summary 

It  is  clear  that  the  shape  of  the  value  function  and  the  processing  load  level  has  a  great  effect  on 
which  algorithm  is  better  for  the  total  system  performance  when  the  system  is  overloaded.  This 
experimentation  seems  to  indicate  that,  with  improvements,  a  scheduler  which  overtly  uses  the 
application-defined  system  value  to  handle  scheduling  should,  at  least  in  principle,  provide  a 
consistently  better  schedule.  In  fact,  such  improved  performance,  particularly  in  overload  conditions, 
does  result  from  our  two  experimental  scheduling  algorithms,  although  our  analysis  on  them  is  by  no 
means  complete  at  this  time. 


324 


4.  Real-Time  Operating  Systems  Interface 

There  are  many  potential  approaches  to  utilizing  a  time-driven  scheduling  model  in  a  real-time 
operating  system.  Although  traditional  real-time  operating  systems  provide  a  simple  set  of  time 
control  primitives  (e.g.,  GetTime,  Schedule,  Timeout),  no  system  primitives  are  available  to  explicitly 
express  request  fl,  critical  estimated  computation  times  or  a  value  function  V^(t)  for  each 
process.  Furthermore,  a  real-time  operating  system  should  be  able  to  support  a  primitive  to  modify 
the  value  function  for  a  process  or  a  set  of  processes  during  run  time,  as  well  as  primitives  to  specify  a 
system-wide  scheduling  policy.  In  this  way,  an  application  designer  can  set  up  and  modify  a  suitable 
scheduling  policy  under  various  system  conditions. 

For  the  purpose  of  describing  these  operating  system  primitives,  we  assume  that  primitives  to  create 
and  kill  processes  already  exist.  We  express  each  primitive  as  if  it  were  implemented  as  a  procedure 
in  a  high-order  language  such  as  C  or  Pascal. 

4.1 .  Time  Control  Primitives 

The  arguments  to  these  operating  system  primitives  communicate  the  information  needed  to 
implement  this  model,  but  an  important  issue  is  the  structure  of  the  information  to  be  passed  to  the 
operating  system.  In  the  following  primitives,  the  structure  of  the  passed  information  has  been 
designed  to  encourage  internal  consistency.  Using  a  single  primitive  to  define  each  parameter  would 
be  extremely  flexible,  but  a  user  might  inadvertantly  set  mutually  inconsistent  parameters.  Thus,  the 
system  should  provide  a  simple  way  to  define  a  consistent  set  of  parameters. 

The  Delay  primitive 

DateTime  =  Delay(DelayTime,  CriticalTime, 

CompTime,  SetFlag) 

blocks  the  requesting  process  a  specified  length  of  time  (e.g.,  in  microseconds),  then  returns  the 
current  date  and  time  when  the  delay  is  completed  and  execution  has  resumed.  This  primitive  would 
also  be  used  to  specify  the  critical  time  0,  (see  Section  2)  and  an  estimate  of  the  amount  of 
processing  time  that  will  be  required  to  reach  the  critical  time*  *  C,,  as  well  as  to  mark  the  arrival  of  the 
process  at  the  critical  event  for  which  the  critical  time  was  established  (while  optionally  defining  the 
next  critical  tine)  by  setting  "SetFlag"  value.  If  a  value  of  "DelayTime"  is  not  positive,  the  system 
would  not  block  the  process,  and  similarly  the  estimated  execution  time  and  critical  time  would  not  be 
changed  if  its  value  were  not  positive. 


*A  system  may  also  estimate  this  processing  time  based  on  observations  of  earlier  executions. 


325 


For  instance,  a  simple  command  to  alter  the  critical  time  can  be  defined  by  using  the  Delay  primitive: 
Delay(0,  CriticalTime,  0,  TRUE) 

4.2.  Periodic  Processes 

While  the  Delay  primitive  provides  time  dependent  processing  control  at  runtime,  periodic  repetitive 
processing  can  be  specified  at  process  creation  time. 

One  way  to  describe  a  periodic  process  is  by  using  optional  arguments  in  a  CreateProcess  primitive: 

pid  =  CreateProcess(ProcessName,  InitialMsg, 

PERIODIC,  DelayTime,  Period) 

The  "CreateProcess"  primitive  would  create  a  new  instance  of  a  process  at  a  specified  node. 
"InitialMsg"  indicates  an  initial  message  to  be  immediately  sent  to  the  new  process  and  "PERIODIC" 
specifies  that  the  new  process  will  be  periodic.  "DelayTime"  seta  the  first  request  time  for  the  new 
process  and  "Period"  is  the  indicated  period. 

4.3.  Scheduling  Policies 

For  a  practical  real-time  operating  system,  it  is  necessary  to  provide  a  mechanism  to  express  the 
application-defined  scheduling  policies  needed  to  implement  our  scheduling  model.  The  system 
should  be  able  to  modify  these  policies  at  runtime  in  order  to  take  advantage  of  the  flexibility  of  the 
model.  The  value  function  of  the  time-driven  model  could  be  provided  as  an  executable  function  by 
the  user,  or  the  user  could  express  various  scheduling  policies  by  defining  a  dynamic  set  of  values  K^, 
Kg. Kjq  (see  Section  3),  so  that  a  client  need  only  pass  these  values  to  define  a  new  value  function 
Vfi): 

SetValueFunction(  K^,  K^, ....  K^g ),  or 
*SetValueFunction(  V,0 ) 

The  first  primitive  is  acceptable  as  long  as  the  application  does  not  require  new  value  function  which 
cannot  bo  expressed  by  the  fixed  value  function.  The  second  primitive  has  greater  flexibility  in  terms 
of  passing  an  arbitrary  value  function  from  a  client  to  the  system,  but,  since  the  actual  code  of  the 
value  function  would  then  be  provided  by  the  client,  the  scheduler  would  be  required  to  compute  its 
value  in  user’s  address  space  every  time  it  is  needed.  Thus,  this  scheme  may  introduce  a  significant 
overhead  to  the  scheduler  and,  in  addition,  such  flexibility  would  make  it  very  difficult  to  develop  a 
future  scheduler  able  to  use  a  value  function  explicitly  to  make  scheduling  decisions. 

Our  approach  is  to  use  the  notion  of  "policy/mechanism"  separation^\  to  increase  the  flexibility  and 
reduce  scheduling  overhead.  The  basic  idea  is  to  create  a  user-definable  system  module,  called 


326 


policy  definition  module  which  consists  of  a  policy  body  and  a  set  of  policy  attributes.  A  policy 
definition  can  be  placed  in  a  kernel,  a  system  process  or  a  client  process  as  needed  to  control  the 
system  overhead.  A  policy  definition  descriptor  would  indicate  the  location  of  the  policy  definition 
and  the  information  related  to  the  policy  body  and  attribute  set. 

For  instance,  a  policy  definition  module  could  be  defined  and  the  policy  attribute  for  a  specific  policy 

could  be  set  using  the  following  primitives: 

SetPolicy(PolicyName,  PolicyDefinitionDescriptor) 

SetAttribute(PolicyName,  AttributeName, 

AttributeValue) 

The  SetPolicy  primitive  would  bind  a  user-defined  policy  definition  module  to  the  operating  system. 
"PolicyName"  indicates  a  symbolic  name  of  policy  definition  module  and  "PolicydefinitionDescriptor" 
points  to  a  data  structure  which  describes  the  policy  definition  module’s  location  and  its  properties. 
The  SetAttribute  primitive  sets  the  value  of  a  specified  attribute  for  the  specific  policy  module. 

For  example,  the  following  calls  could  be  used  to  replace  the  the  SetValueFunction  primitive  above. 
SetPolicy(  SCHEDULE,  MyPDD ) 

SetAttribute(  SCHEDULE,  "VALUEFUNCTION". 

. K^o) 

This  SetPolicy  primitive  would  bind  a  scheduling  policy  module  described  by  "MyPDD"  to  the 
operating  system.  In  the  case  of  the  scheduling  policy  module,  we  would  place  the  proposed  value 
function  at  the  kernel  level.  The  SetAttribute  primitive  then  defines  a  shape  of  the  value  function  by 
giving  the  values  K^,  Kj,  ....  K^q.  We  believe  that  such  a  proposed  value  function  is  rich  enough  to 
express  a  powerful  set  of  scheduling  policies  and  we  have  reduced  the  scheduling  overhead  while 
increasing  scheduling  flexibility  by  not  attempting  to  pass  the  arbitrary  function  V|(t)  itself. 

5.  Summary 

Process  scheduling  is  a  frequently  overlooked  determinant  of  real-time  performance.  It  Is  our 
contention  that  time  value  of  process  completion,  even  though  it  is  the  primary  characteristic  of  a  real 
time  system,  has  been  neglected  in  measuring  real-time  performance.  We  have  illustrated  some  of 
the  effects  of  varying  the  scheduling  algorithms,  particularly  in  the  presence  of  an  overload  condition, 
fciati ve  to  several  value  function  characteristics. 


This  use  of  value  functions  to  describe  the  performance  of  existing  scheduling  algorithms  represents 


327 


a  first  step  in  our  goal  of  eventually  producing  a  scheduler  along  the  lines  of  BEValue2  which  will 
explicitly  use  such  value  functions  to  make  scheduling  decisions^.  Such  a  scheduler  would  attempt 
to  enhance  the  overall  system  value  by  ensuring  that  processes  providing  the  highest  value  to  the 
system  are  favored  in  the  event  that  some  processes  must  be  delayed  or  aborted  due  to  a  (possibly 
transient)  overload.  This  would  provide  a  level  of  predictability  to  overload  processing  in  a  real-time 
system  which  is  not  generally  available  at  present,  and  should  therefore  allow  more  effective  use  of 
the  available  resources,  lowering  system  cost. 

This  paper  has  presented  a  preliminary  overview  into  a  significant  research  effort  recently  initiated 
within  the  Archons  project.  The  goals  of  this  work  include  further  understanding  the  scheduling 
problem  in  the  presence  of  user-defined  value  functions,  determining  the  heuristics  necessary  to 
create  a  tractable  value  function  scheduler,  understanding  the  sensitivity  of  such  a  scheduler  to  the 
parameters  defined  by  its  heuristics,  and  analyzing  the  decisions  made  by  those  heuristics. 
Additional  meeisurements  with  modifications  to  the  scheduling  algorithms  could  be  made  providing 
for  such  capabilities  as  defined  lower  bounds  for  use  in  process  abortion  when  the  value  function 
drops  below  zero,  as  well  as  studies  of  the  effects  of  varying  the  heights  and  shapes  of  the  value 
functions  among  the  processes.  The  methods  for  determining  which  value  function  to  use  for 
specific  types  of  devices  and  algorithms  need  to  be  studied,  the  implications  of  vaiue  function 
scheduling  on  the  specification  of  scheduling  policy,  methods  of  decomposing  value  functions  for  a 
number  of  processes  cooperating  to  perform  a  single  application  function  need  to  be  determined,  and 
the  use  of  value  functions  in  the  presence  of  process-to-process  dependencies  and  mutual  exclusion 
must  be  investigated.  Following  the  initial  effort,  this  research  will  be  extended  by  the  Archons 
project  for  process  scheduling  on  a  distributed  system. 

Acknowledgements 

Professor  Jensen  created  the  initial  concepts  and  development  of  continuous  time-valued  scheduling 
for  real-time  systems.  Mr.  Locke,  in  his  Ph.D.  thesis  research,  is  further  developing  the  model, 
defining  the  required  heuristics,  and  evaluating  them  through  simulation.  Dr.  Tokuda  is  responsible 
for  the  incorporation  of  these  ideas  into  a  scheduling  facility  for  the  Archons  project’s  ArchOS 
operating  system. 

The  authors  are  greatly  indebted  to  the  many  members  of  the  Archons  project,  especially  Professor 
John  Lehoczky  and  Dr.  Lui  Sha,  who  provided  considerabie  assistance  in  the  definition  of  these 
concepts,  as  well  to  the  Archons  sponsors. 


328 


References 

1.  Jensen,  E.  D.,  Distributed  Control,  Springer- Verlag,  1981 ,  pp.  175-190. 

2.  Jensen,  E.  D.,  "Decentralized  Executive  Control  of  Computers",  Proceedings  of  the  Third 
International  Conference  on  Distributed  Computing  Systems,  IEEE,  October  1982,  pp.  31-35. 

3.  Jensen,  E.  D.,  "ArchOS:  A  Physically  Dispersed  Operating  System”,  IEEE  Distributed 
Processing  Technical  Committee  Newsletter,  June  1984. 

4.  Locke,  C.  D.,  "Best-Effort  Decision  Making  for  Real-Time  Scheduling”,  Ph.D.  Thesis  Proposal, 
Carnegie-Mellon  University,  Computer  Science  Department 

5.  Conway,  Maxwell,  and  Miller,  Theory  of  Scheduling,  Addison -Wesley,  1967. 

6.  Garey,  M.  R.;  Johnson,  D.  S.,  Computers  and  Intractability:  A  Guide  to  the  Theory  of 
NP-Completeness,  W.  H.  Freeman,  San  Francisco,  1979. 

7.  Jensen,  E.  D.,  Private  Communication  based  on  his  Honeywell  Systems  and  Research  Center 
technical  report  on  this  topic  for  the  U.S.  Army  Ballistic  Missile  Defense  Advanced  Technology 
Center. 

8.  Mok,  A.  K.,  Fundamental  Design  Problems  of  Distributed  Systems  for  the  Hard-Real-Time 
Environment,  PhD  dissertation,  Massachusetts  Institute  of  Technology,  May  1983. 

9.  Liu,  C.  L.;  Layland,  J.  W.,  "Scheduling  Algorithms  for  Multiprogramming  in  a  Hard-Real-Time 
Environment”,  Journal  of  the  Association  for  Computing  Machinery,  Vol.  20,  No.  1,  January 
1973,  pp.  46-61. 

10.  Lehoczky,  J.  P.;  Sha,  L.,  publication  to  appear. 

11.  Wulf,  W.  A.;  Levin,  R.;  Harbison,  S.  P.,  HYDRA/C.mmp:  An  Experimental  Computer  System, 
McGraw-Hill,  Inc.,  1981. 


329 


4.4  Dynamic  System  Reconfiguration 


330 


Dynamic  System  Reconfiguration 

1  Introduction 

In  this  section,  we  summarize  our  work  on  best  effort  decision  making  algorithms  for  dynamic  system 
reconfigur^iLion.  Tlie  objective  cf  dynam’c  system  reconfiguration  is  to  re-organize  a  system  to  adapt  U3  a 
changing  tactical  environment  or  to  overcome  hardware  failures.  The  study  of  reconfiguration  algorithms  is 
important  to  distributed  tactical  systems  that  must  operate  in  hostile  and  ever-changing  environments. 

Conceptually,  the  three  goals  of  reconfiguration  are  minimal  disturbance,  minimal  vulnerability  and  max¬ 
imal  feasibility,  and  the  ability  to  reconfigure  the  current  task  set  without  shedding  load.  The  goal  of  minimal 
disturbance  requires  the  reconfiguration  algorithm  not  to  disturb  the  critical  modes  in  functioning  processors, 
so  that  the  system  can  carry  out  its  mission  without  much  additional  disruption.  The  goal  of  maximal 
feasibility  focuses  on  the  restoration  of  as  many  disabled  modes  as  possible.  The  goal  of  minimal  vul¬ 
nerability  is  introduced  to  ensure  that  as  a  result  of  reconfiguration  the  system  is  made  as  fault  tolerant  to 
failures  as  possible,  so  that  the  impacts  of  future  system  failures  are  lessened.  Conceptually,  tire  notions  of 
minimal  disturbance  and  vulnerability  help  to  eliminate  the  undesirable  regions  in  the  solution  space,  while 
the  notion  of  maximal  feasiblity  seeks  to  have  as  large  a  solution  space  as  possible. 

We  have  successfully  developed  a  system  of  algorithms  that 

1.  minimize  the  disturbance  to  the  functioning  tasks  in  the  course  of  reconfiguration, 

2.  reduce  the  vulnerability  to  future  system  failure, 

3.  and  operate  in  real-time. 

2  Problem  Complexity  and  Algorithm  Design  Decisions 

In  a  distiibuted  system,  the  reconfiguration  task  belongs  to  the  class  of  quadratic  integer  programming 
prubiems.  The  term  "quadratic"  refers  to  die  pairwise  nature  of  the  communic.ition  traffic  and  the  term 
"integer"  refers  to  Hie  indivisibility  of  a  process.  The  determination  of  the  optimal  solution  to  this  type  of 
problem  is  knov/n  to  be  a  special  type  of  NP-coinpletc  problem,  which,  except  for  very  small  sizes,  "will 
remain  intractable  perpetually"  [i<arp72).  To  appreciate  the  dilTiCulty,  wc  report  the  work  done  by  Kakii  and 
Tliompsoii  [Kaku83].  They  used  one  of  the  best  known  algorithms  coded  in  Fortran  on  a  DHC-20.  It  took  on 
avciage  31.43  seconds  to  obtain  the  optimal  solution  for  an  8  by  8  problem,  220.91  seconds  and  1,305.65 
sci'diuls  for  9  by  9  and  10  by  10  problems,  rcspcclivcly.  The  rapid  c.xponcntially  increasing  compulation  time 
Is  characteristic  of  this  type  of  problem.  Owing  tlic  (he  very  nature  of  this  problem,  wc  had  to  give  up  Uic 
■eandi  for  uptmial  '.(.'lutioi'S  and  develop  our  own  hybrid  system  of  fast  algocillims  for  the  rccon figuration 


331 


problem.^ 

3  Basic  Concepts:  Disturbance  and  Vulnerability 

When  a  failure  of  a  processor  or  other  hardware  element  occurs,  there  is  an  immediate  interruption  of 
various  functioning  processes  either  because  the  processor  in  which  they  were  running  failed  or  because  the 
communication  pavhs  being  used  were  disrupted.  It  is  important  to  rcaliiic  diat  processes  are  the  fundamental 
unit  of  reconfiguration:  however,  modes  are  the  fundamental  unit  of  disturbance.  A  system  mode  consists  of 
a  collection  of  processes.  If  any  of  those  processes  fails  (because  a  processor  fails  or  a  communication  path  is 
interrupted),  the  mode  is  disrupted.  Thus,  modes  are  the  unit  of  the  system  functionality. 

One  must  now  reconfigure  the  system  to  overcome  the  failure.  This  can  be  done  by  rerouting  messages, 
relocating  processes,  shedding  load  or  all  of  these  three.  There  are  three  distinct  aspects  to  be  considered: 

1.  The  present  disturbance  incurred  by  the  failure. 

2.  The  additional  disturbance  caused  by  reconfiguration. 

3.  The  vulnerability  of  the  subsequent  reconfigured  system. 

We  next  analyze  each  of  these  aspects.  Let  us  begin  by  quantifying  the  disturbance. 

3.1  Quantification  of  Disturbance 

In  a  real  time  command  and  control  environment,  a  system  will  require  a  large  number  of  modes  to 
function  properly.  Clearly,  some  modes  will  be  of  greater  importance  tlian  others.  Wc  will  assume  that  it  is 
possible  to  attach  an  index  of  criticality  to  each  of  the  modes  and  that  this  set  of  indices  is  available  to  the 
reconfiguration  algorithm.  This  index  is  a  measure  of  the  importance  of  one  mode  relative  to  another  fixed 
reference  mode.  This  index  can  be  used  as  scoring  device  by  which  to  measure  the  amount  of  disturbance 
caused  by  an  interruption  of  a  set  of  modes.  The  scoring  of  disturbance  is  a  very  serious  issue.  It  is  most 
convenient  to  adopt  a  linear  scoring  rule.  'I'his  ailc  would  operate  as  follows.  If  a  set  of  modes  ceased  to 
function,  then  tlic  total  disturbance  is  taken  to  be  the  sum  of  the  individual  indices  of  criticality.  This  is  a 
reasonable  approach  in  most  cases  and  is  gcneia'Iy  very  convenient;  howcvia-.  there  arc  instances  where  it  is 
inadequate  and  even  misloa.ding.  For  example,  suppose  that  we  consider  two  inodes  associated  with  scanning: 
scanning  rnodc  I  and  scanning  mode  2.  Hach  may  have  the  same  index  of  criticality,  fhc  loss  of  a  single 
scanning  mode  is  not  cat.istropliic,  because  there  is  a  second  which  can  .hilfill  part  of  the  scanning  function.  If, 
however,  both  scanning  modes  arc  lost,  the  impact  is  very  .serious,  far  greater  than  the  sum  of  the  two  index 


'  c'  i.'nfi.'i!  fl'lTtiOdcc  Ixi'vcc'i  m  opIifTijI 
•oiu'i’in  '^p-u  i.',  v.hi!i‘  ihu  l.iMCi  lii.  c-licll 

.i'  '(/•  iw :  :  '  i ..  i  ■ '  \o.  :inl  .  .-pi.i'uii;. 

''■ulhdib 


•I 'fd ifh/n  .ind  .1  rp./ rilhui  »n  Mp*.  f.'tn),  w  \  i' .m  ihr  ipi -'n  ;  nvi'i  '‘fj  .  iiiM' 

;«•  ‘(V  «*,.•.!•  Mi;!  pr  lltc  ■  Pi'iUoil  j'.U  C  in Ml.lpic  I.il  '  '  Ojn,  l.,;  ;oi.  .  1  1 

• -I-'  I ! i  \ '.  r.  !a\»  'I' i  •  vt' il:.  *  pliMidl  :  .  i  !  ‘  m 


332 


scores,  because  the  entire  scanning  fimction  has  been  lost.  This  situation  is  representative  of  a  non-linear 
scoring  problem.  That  is,  the  total  score  is  not  simply  the  sum  of  the  individual  scores. 

An  alternative  to  a  linear  approach  is  a  utility  junction  approach.  We  must  define  a  function  which  takes  sets 
of  disrupted  modes  and  evaluates  them  numerically.  This  function  can  be  as  general  as  the  situation  warrants 
and  must  be  constituted  by  hand  for  all  the  possible  failure  subsets.  This  is  a  feasible,  but  difficult  and 
exacting  task. 

We  propose  to  adopt  a  hybrid,  intermediate  scoring  approach.  This  approach  begins  with  a  linear  scoring 
rule  but  the  reconfiguration  algorithm  will  take  the  non-linearities  into  account.  Furthermore,  we  do  this  in  a 
simple  and  fast  algorithmic  way.  This  approach  is  as  follows.  We  consider  each  of  the  modes  separately  and 
define  an  index  of  criticality  for  each.  Next,  we  consider  the  modes  in  pairs.  For  each  pair,  wc  determine 
whether  the  simultaneous  failure  of  each  would  be  of  profound  seriousness.  If  so,  we  create  an  index  of  joint 
criticality  and  define  a  pseudo  resource  of  1  unit  for  tliis  pair.  Both  modes  are  assumed  to  require  a  0.6  units 
of  this  pseudo  resource,  while  all  other  modes  require  0  units  of  this  resource.  We  continue  in  this  fashion 
creating  critical  pairs,  indices  of  Joint  criticality  and  pseudo  resources.  One  can  continue  with  triplets  of  Joint 
criticality  indices  and  pseudo  resources.  In  this  case,  each  member  of  a  triplet  requires  0.4  units  of  tlic  pseudo 
resource,  while  all  other  modes  require  0  units  of  it.  The  result  of  this  approach  is  that  the  reconfiguration 
algorithm  will  be  able  to  avoid  critical  Joint  disuirbance  and  avoid  putting  Jointly  critical  modes  in  the  same 
processor  which  would  otherwise  create  an  unnecessarily  large  future  vulnerability. 

The  creation  of  pseudo  resources  addresses  the  non-linear  scoring  problem  and  allows  us  to  keep  a  linear 
scoring  approach.  Notice,  however,  that  each  additional  pseudo  resource  brings  new  constraints  into  the 
reconfiguration  algorithm  and  hence  reduces  reconfiguration  potential.  Consequently,  one  should  use  this 
approach  sparingly. 

3.2  Future  Vulnerability 

When  a  failure  occurs,  certain  modes  will  be  disnipted  and  a  criticality  score  can  be  computed  using  the 
linear  rule  and  joint  indices.  After  examining  the  impact  of  the  failure,  tlic  system  must  be  rccouilgurcd,  and 
additional  disturbance  may  occur  owing  to  the  reconfiguration.  Tlio  initial  part  represents  sunk  cast,  t1ic  cost 
from  the  failure.  'I'hc  reconfiguration  part  can  only  add  to  the  cost.  'I'hcrcforc,  there  is  a  powerful  incentive  to 
incur  the  smallest  possible  additional  cost  in  reconfiguration.  'I'his  is,  however,  shortsighted.  The  sunk  costs 
which  seem  to  be  unavoidable  were  actually  the  result  of  tlie  previous  rcconfiguiiition  decision.  The  recon¬ 
figuration  from  the  previous  failure  left  the  sy.stcm  in  a  state  which  Icail  to  the  current  sunk  cost.  Tlic  current 
reconfiguration  will  crc.atc  a  future  sysicm  vulnemhility.  Clearly,  we  would  like  this  vulnerability  to  he  kept  as 
low  as  possible. 

Nov/  wc  riuantify  (his  concept,  I'ornmately,  it  is  quite  simple  to  qii.'iUify  (iitn'c  syuem  vulnsmi'iTty  using 


333 


an  expected  value  approach.  We  focus  on  a  fixed  processor  type  (e.g.  UYK-44).  Consider  any  potential 
configuration  of  processes  in  processors.  Each  processor  will  support  one  or  more  modes.  If  that  processor 
were  to  fail,  then  the  sunk  cost  disturbance  score  would  be  the  sum  of  the  individual  mode  index  score  plus 
the  joint  index  score  if  any  such  sets  are  co-resident  in  the  processor.  Thus,  we  can  compute  the  resulting  sunk 
cost  disturbance  associated  with  each  single  processor  failure.  We  assume  that  each  of  tlicse  processors  are 
equal  likely  to  fail,  thus  the  future  system  vulnerability  is  the  average  of  the  individual  total  processor  scores. 
Any  configuration  can  be  evaluated  in  this  fashion. 

3.3  Trade-offs 

We  have  identified  three  aspects  of  the  cost  of  a  particular  failure:  sunk  costs  from  a  failure,  the  present 
additional  cost  from  the  reconfiguration  activity  and  the  average  sunk  cost  of  next  failure  (the  future 
vulnerability).  We  focus  entirely  on  a  single  processor  type,  since  processes  in  a  failed  processor  must  be 
moved  into  the  same  processor  type.  Once  a  failure  occurs,  one  can  compute  the  sunk  costs  by  noting  all  the 
modes  that  have  been  disturbed  and  adding  all  the  individual  criticality  scores  and  joint  criticality  scores.  We 
label  this  score  for  current  score. 

We  now  consider  a  particular  feasible  reconfiguration.  We  identify  the  modes  disturbed  by  failure  and  by 
reconfiguration,  and  calculate  the  total  disturbance  score  which  results.  This  total  disturbance  score  is 
denoted  as  S^.  The  additional  disturbance  from  reconfiguration  for  this  solution  is  -  S^..  The  new 

configuration  has  a  future  vulnerability  associated  with  it,  the  average  sunk  costs  associated  with  the  failure  of 
each  of  the  processors  of  this  type.  We  label  this  for  future  cost.  In  summary,  each  feasible  reconfiguration 
has  two  numbers  associated  with  it,  and  Sf. 

There  arc  two  Issues  which  need  to  be  addressed.  Ideally,  one  would  like  to  find  a  reconfiguration  in  which 
both  Sj  and  Sj.  are  small.  However,  these  two  goals  are  often  in  conflict.  In  addition,  we  have  discussed  a 
typical  reconfiguration.  The  number  of  such  feasible  reconfigurations  could  be  enormous,  far  greater  than 
could  be  examined  in  real  time.  It  follows  that  we  must  define  heuristics  which  will  guide  us  close  to  a 
solution  with  small  values  of  and  S^.  Generally,  we  prefer  to  fii-st  specify  so  that  the  disturbance  owing 
to  reconfiguration  is  bounded.  We  then  try  to  find  a  reconfiguration  with  a  small  future  vulnerability  S^.  The 
detennination  of  an  acceptable  requires  some  judgement  and  certainly  depends  on  the  current  mission 
scenario.  We  treat  the  threshold  limit  on  as  an  input  parameter  to  die  algorithmic  system.  Generally,  if  a 
reconfiguration  can  be  accomplished  without  further  disturbance,  for  c.xamplc,  by  using  spares,  it  will  be 
automatically  carried  out.  If  some  disturbance  is  ncccssai7,  then  input  from  the  operator  is  sought.  'I'hc 
operator,  informed  of  the  current  disturbance,  can  citlier  specify  the  set  of  modes  which  can  be  disturbed 
during  reconfiguration  or  the  maximal  allosvablc  value  for 


334 


4  Algorithm  Development  Approach 

Having  discussed  the  principles  underlying  the  issues  of  disturbance  and  vulnerability,  we  now  develop  our 
algorithms.  We  first  describe  our  overall  two  phase  approach  as  a  basis  for  the  design  of  fast  algorithms.  We 
then  describe  the  implementation  of  the  minimization  of  disturbance  and  of  vulnerability  as  well  as  the 
maximization  of  the  feasibility  in  the  context  of  this  two  phase  approach. 

4.1  The  Two  Phase  Search  Approach 

We  have  pointed  out  that  due  to  the  communication  constraints,  our  reconfiguration  problem  is  of  the  type 
of  quadratic  integer  assignment  problems  whose  optimal  solutions  are  generally  impossible  to  obtain  unless  the 
problem  size  is  extremely  small.  Thus,  we  must  develop  fast  algorithms  that  focus  their  search  in  the 
promising  area  of  the  solution  space. 

The  constraints  of  our  reconfiguration  problem  can  be  divided  into  two  classes:  linear  constraints  such  as 
CPU  sycles  and  memory  units  and  the  quadratic  constraints,  bandwidtli  of  the  buses.  Since  a  bus  is  one  of 
the  most  reliable  elements  in  the  system,  it  is  less  likely  to  fail  as  compared  with  processor  elements.  This 
indicates  that  a  promising  area  in  the  solution  space  is  the  linear  subspace.  Based  upon  this  observation,  we 
divide  our  search  of  the  solution  space  into  two  phases:  the  linear  phase  and  the  quadradc  phase.  In  the 
linear  phase,  we  try  to  satisfy  all  the  linear  constraints.  Once  the  linear  constraints  are  satisfied,  we  have  a 
candidate  soution  which  will  be  checked  to  sec  if  it  satisfies  the  bus  constraints.  If  it  docs,  then  wc  have  found 
a  solution.  If  not,  we  enter  the  quadratic  phase  to  further  minimize  the  communication  traffic  on  the  bus. 

4.2  Linear  Phase:  Satisfying  Processor  Constraints 

In  the  linear  phase,  our  reconfiguration  problem  is  of  the  type  known  as  integer  assignment  problems.  The 
constraints  are  CPU  utilization,  memory  utilization  and  communication  port  utilization.  Note  that  the  port 
utilization  is  additive.  There  arc  generally  three  possible  formulations  to  this  problem. 

1.  Formulate  the  problem  as  an  optimal  zero-one  integer  programming  problem  and  use  the  branch 
and  bound  mcdiod. 

2.  Formulate  tlic  problem  as  the  linear  programming  problem  of  the  "cutting  stock"  type 
[ChvaUil83J. 

3.  Formulate  tlie  problem  as  a  multi-dimensional  bin  packing  problem  [Coffrnan83]. 

Fach  of  these  formulations  can  be  used.  However,  wc  prefer  tlie  bin  packing  formulation  because  of  the 
existence  oU\\c  first  fit  r/(’crca.vf>;i’(FFI))  algoritlim.  which  is  known  to  be  very  fast  and  often  produces  citlier 
optim;il  or  near  optimal  results  [FredcricksonSO,  l{cntlcy83].  The  development  of  our  algorithm  for  die 
^cai'li  of  (lie  linear  sub'.pace  will  be  based  upon  the  Ml)  algorithm. 

In  (!.e  sin,:ic  cli/.<onsion  case,  tlic  III)  .tlgoriLlmi  i'j  to  iirr.t  order  the  ilems  in  dtei\Msin;>,  size.  We  then  try  to 


335 


put  the  largest  item  into  tlie  first  bin.  Next,  we  try  to  put  the  next  largest  one  into  tire  f'rst  bin  and  so  on  until 
the  first  bin  cannot  be  filled.  We  then  use  the  second  bin  and  repeat  the  same  algorithm.  If  tlie  bin  sizes  are 
different,  v,  e  order  the  bin  in  increasing  order  according  to  their  sizes.  The  idea  of  FFD  is  to  get  the  most 
difficult  one  done  first:  try'ing  to  put  the  largest  item  into  the  smallest  bin  first.  When  the  problem  is 
multi-dimensional,  we  use  the  largest  number  of  requirements  to  order  items  and  the  smallest  number  of  its 
capacities  to  order  bins.  For  example  if  a  process  (item)  needs  20%  CPU  and  10%  memory  then  its  scalar 
measure  is  0.2.  If  a  processor  (bin)  has  70%  CPU  available  and  30%  memory  available  for  reconfiguration 
task,  then  30%  is  the  scalar  measure. 

4.3  Quadratic  Phase:  Minimizing  Communication  Traffic 

When  the  bus  communication  constraints  arc  violated,  we  enter  the  quadratic  phase.  In  this  phase,  our 
objective  is  to  minimize  the  traffic  flow  on  the  bus.  Our  quadratic  integer  assignment  problem  can  be 
formulated  as  follows; 

1.  Algorithms  based  upon  the  Branch-and-Dound  method  [UillierSO]. 

2.  Algorithms  based  upon  the  estimation  method  [Graves70]. 

3.  Algorithms  based  upon  tlie  cluster  idea  used  in  real-time  task  allocation  [Chu80]. 

In  a  real-time  environment,  the  only  feasible  algorithms  arc  the  clustering  algorithms  because  either  tlie 
Branch-and- Bound  or  the  esimation  method  are  known  to  be  slow.  The  clustering  algorithms  are  based  upon 
the  following  observation.  If  we  first  group  the  heavily  comunicating  tasks  together  before  assigning  tliem  to 
tlie  processors,  then  the  solution  space  is  significantly  reduced.  One  can  argue  that  this  reduced  space  is 
promising  because  many  heavily  communicating  processes  are  already  bound  together  and  they  are  no  longer 
to  communicate  over  buses. 

4.4  Controlling  Disturbance 

We  have  mentioned  that  vve  arc  in  favor  of  keeping  the  reconfiguration  disturbance  S,  to  a  low  level  while 
trying  to  prodnee  a  reconfiguration  tliat  is  low  in  future  vuln.crability.  Wc  assume  that  the  default  value  of 
is  zcio.  That  is,  without  the  opciator’s  c.xplieit  permission,  tlie  oiily  proccs.tcs  that  arc  allowed  to  be  relocated 
by  tlie  reconfiguration  system  arc  those  asr.oci.uod  with  the  modes  disabled  by  the  hardware  failures.  'I'hc 
reconfiguration  system  will  .il.vays  attempt  to  restore  disabled  modes  at  zero  disturbance  and  report  the 
results  to  the  operator.  If  all  disabled  modes  oannot  be  restored  at  a  zero  score,  the  operator  has  the  option 
of  explicitly  naming  the  set  of  inodes  that  can  he  relocated  or  ot  giving  a  level  of  allowable  reconfiguralion 
disturbance  S^.  if  5^  is  given,  tli^n  the  rcconiigur.uion  system  will  firsL  oy  uj  enlaigc  the  soiuiiun  sp.icc  as 
much  as  pooihle,  '.vliiic  kcepiir’,  ihc  disturbance  level  holow  S^. 

\n  cxamplo  niav  help  lo  illu  tr.it"  these  ids. is.  Suppo,!-  lint  wc  consid  :r  five  mt'dcs.  two  .i  •■•o'.  i.itcd  wlih 
.  limine  .'nd.  'hr'c  v.ith  i'li’  ciotK  I.  the  ■  ■■.imiiiig  iiiod-'s  h.c.’c  indiva'.i;  1  ■,  i  e '(','1:1  y  scmc  0  1  .md 


336 


criticality  score  0.8.  The  fire  control  modes  have  individual  criticality  score  0.15  and  joint  criticality  score  0.9. 
In  reconflguration,  suppose  that  one  of  the  scanning  modes  is  the  only  mode  in  the  failed  processor  and  that 
the  operator  specifies  =  0.6.  If  we  try  to  relocate  the  functioning  scanning  mode,  then  we  have  a  distur¬ 
bance  score  given  by  0.1  +  0.8  =  0.9.  The  0.1  is  the  individual  criticality  score  for  disturbing  the 
Ainctioning  scanning  mode.  The  0.8  is  the  joint  criticality  score  for  disturbing  both  scanning  modes,  one  of 
which  was  disturbed  by  the  processor  failure.  Since  0.9  >  0.6,  we  must  not  disturb  the  only  functioning 
scanning  mode  in  the  reconfiguration  activity.  If  we  relocate  two  fire  control  modes,  the  disturbance  score  is 
0.15  +  0.15  =  0.3.  If  we  relocate  all  three  fire  control  modes,  the  disturbance  score  is  0.15  -f  0.15  +  0.15  + 
0.9  =  1.35.  In  this  example,  we  can  only  relocate  the  scanning  mode  disabled  by  failure  plus  two  functioning 
fire  controlling  modes. 

Generally,  once  the  allowable  disturbance  level  for  reconfiguration  is  given,  we  need  an  algorithm  to 
identify  the  set  of  modes  that  can  be  moved  without  exceeding  the  specified  value  of  S^.  It  is  possible  that 
this  computation  has  already  been  carried  out  off-line  and  results  are  stored  in  the  database.  Thus,  for  a  given 
we  would  only  need  to  look  up  the  movable  modes  in  the  database.  We  also  provide  an  on-line  algorithm 
to  carry  out  this  activity.  This  algorithm  is  a  modified  form  of  the  FFD  bin  packing  algorithm  called  the 
reverse  FFD  algorithm,  which  takes  a  set  of  failed  modes  and  identifies  a  set  of  functioning  modes  that  can  be 
disturbed.  The  identified  set  satisfies  the  disturbance  constraint  and  is  chosen  to  offer  a  high  probability  of 
finding  a  feasible  reconfiguration.  This  algorithm  is  described  in  Section  2.3.1. 

4.5  Controlling  Vulnerability 

We  have  developed  our  approach  to  keep  the  reconfiguration  disturbance  below  a  permitted  level  S^.  Given 
that  this  constraint  is  satisfied,  we  now  must  minimize  the  fliture  vulnerability.  As  discussed  before,  we  cannot 
hope  to  find  the  minimal  future  vulnerability  configuration  in  real  time.  However,  we  can  take  advantage  of 
the  characteristics  of  the  SubACS  system  and  develop  an  approach  that  produces  nearly  optimal  results. 

When  a  processor  fails,  all  the  modes  relying  on  the  pr(x:esses  running  on  tills  processor  arc  intermpted. 
Thus,  to  minimize  fiiture  vulncrabiliry,  a  key  principle  is  to  avoid  putting  multiple  modes  on  a  single  proces¬ 
sor,  especially  jointly  critical  modes.  Another  important  aspect  in  mininii/ing  fiiturc  vulnerability  is  to  try 
pack  .ill  die  processes  of  a  mode  into  as  few  processors  as  possible,  because  when  a  mode  is  supported  by 
several  processors,  tlic  failure  of  any  of  tlicsc  processors  interrupts  the  mode.  Ik'th  of  these  aspects  can  be 
accoinplisiied  by  modifying  die  mulu-dintcnsionai  FFD  algorithm  in  tire  linear  phase  of  the  reconfiguration 
prix;c(lurc  and  the  clusioringalgoritlim  in  the  quadratic  phase. 

In  tile  linear  phase,  to  .I'.'iid  pulling  muiliplc  critical  iiioclos  into  a  single  processor,  we  associated  p'  l  udo 
rcviiiioes  svith  each  proo'',,o(.  1-acli  iif  the  re aiuires  o'orrcspoiids  to  a  joint  critical  index  as  divaissi'd 

li'diirc.  In  die  c:.anip!o  oe  di-.cusscd  o  niicr.  llic  (wo  si  .■•I'liiiig  modes  have  a  joint  critic.dity  sco'X'  ofO.S.  I'o 
pre'.e’i'  die  procosis’s  oi  lioih  scamdiig  modes  heine  pni  into  a  sinnle  riucNSor,  \V''  .■  ••ociaie  one  nnii  of 


337 


pseudo  scanning  resource  with  each  of  the  processors  that  can  support  scanning  modes.  In  addition,  we 
^)ecified  that  a  scanning  mode  requires  0.6  unit  of  the  pseudo  scanning  resource,  while  other  non-scanning 
modes  requires  none  of  this  pseudo  scanning  resource.  Thus,  a  processor  can  only  support  the  processes  of 
one  of  the  two  scanning  modes,  but  not  both.  To  put  the  processes  of  a  mode  into  as  few  processors  as 
possible,  we  first  run  the  multi-dimensional  FFD  algorithm  for  the  most  significant  mode  with  respect  to  the 
processors  with  most  available  capacities.  This  gives  the  best  chance  for  the  FFD  algorithm  to  pack  the  most 
significant  mode  into  the  least  number  of  processors.  After  packing  the  most  significant  mode,  we  repeat  this 
processes  for  the  second  most  important  mode  and  so  on. 

In  the  quadratic  phase,  the  modification  of  the  clustering  algorithm  is  similar  to  the  modification  of  the 
FFD  algorithm.  To  avoid  putting  processes  belonging  to  a  set  of  jointly  critical  modes  into  a  single  processor, 
we  require  the  clustering  algorithm  to  observe  the  pseudo  resource  constraints  associated  with  each  processor. 
To  promote  the  clustering  of  processes  belonging  to  a  same  mode,  we  run  the  clustering  algorithm  on  a  mode 
by  mode  basis.  That  is,  we  examine  the  communication  graph  of  each  mode  and  try  to  cluster  those  processes 
with  heavy  inter-process  comiiiunication  first  Since  processes  belonging  to  the  same  mode  co-operatively 
carry  out  a  task,  they  tend  to  communicate  with  each  other  more  than  with  processes  of  other  modes.  Thus 
the  mode  by  mode  clustering  is  likely  to  produce  results  nearly  as  good  as  the  results  of  global  clustering. 

5  The  New  Algorithmic  System 

In  this  section,  we  first  outline  in  detail  the  reverse  FFD  algorithm,  the  modified  FFD  and  clustering 
algorithms.  Next,  we  present  our  overall  procedure  for  carrying  out  the  reconfiguration. 

5.1  Maximizing  Feasibility:  The  Reverse  FFD  Algorithm 

The  Reverse  FFD  algorithm  is  used  to  identify  the  movable  modes  with  respect  to  the  given  allowable 
disturbance  level  S^.  That  is,  tlte  algorithm  takes  the  current  configuration  and  the  set  of  failed  modes  and 
produces  a  set  of  functioning  modes  which  can  be  disturbed  within  the  disturbance  constraint  S^. 

We  now  list  the  steps  of  this  algorithm. 

1.  If  is  zero,  then  list  all  tlic  modes  that  have  been  disabled  by  tlic  failure  as  movable  modes  and 
exit  this  procedure. 

2.  If  is  non-zero,  tlien  create  a  pseudo  bin  whose  size  is  =  S^,  -f  S^,  where  is  the  current 
distuibance  due  to  the  failure. 

3.  For  e.ach  mode,  we  compute  tlie  average  piocess  resource  utilization  for  each  resource  type.  For 
example,  mode  A  has  3  prrccs.scs  which  use  U.I,  0.2.  and  0.3  units  of  CPU  and  0.4,  0.5,  0.6  units  of 
memory  respectively,  then  in  this  case  the  .iveragc  CPU  utilization  is  (O.l  +  0.2  +  0.3)/3  -  0.2 
units  and  llie  average  nicinory  utilization  is  0.5  units. 

4.  For  each  of  the  fiilcd  modes,  we  jiair  it  wiili  each  of  the  lunctioning  modes  and  chock  tlioir  hint 
compniihtliiy.  If  iho  '  nm  of  tlic  avcMaro  uliii/.ition  ofoai  h  of  the  ic::ouif;''  ty|H'S  is  less  ih.ui  one, 


338 


^then  the  two  modes  arc  said  to  be  compatible.  For  example,  if  mode  A  has  average  CPU  and 
memory  utilization  0.3  and  0.5,  while  these  average  are  0.4  and  0.3  for  mode  B,  then  mode  A  and 
mode  B  are  said  to  be  compatible.  'Phis  is  because  the  total  utilization  are  0.7  and  0.8,  which  arc 
both  less  than  1.  If  a  functioning  mode  is  compatible  with  any  of  the  failed  modes,  then  it  is  said 
to  be  a  candidate  movable  mode.  The  idea  of  compatibility  is  that  if  a  failed  mode  is  CPU 
intensive,  then  we  want  to  match  it  with  a  mode  which  uses  little  CPU,  so  that  they  can  be  packed 
togetlier  later.  We  identify  all  the  candidate  movable  modes. 

5.  For  each  candidate  movable  mode  Mj,  we  compute  its  total  resource  supply  Rj  by  summing  up  all 
the  resources  it  utilizes  in  percentage  of  total  units. 

6.  We  now  try  to  fill  the  pseudo  bin  by  first  putting  ail  the  failed  modes  into  it.  We  then  repeat  the 
following  steps  until  tlie  list  of  candidate  modes  is  emptied. 

a.  For  each  of  the  candidate  movable  modes  M.,  we  compute  the  additional  disturbance  score 
D.,  which  would  arise  if  mode  is  disturbed*. 

b.  For  each  of  the  candidate  movable  modes  M.,  we  compute  the  merit  score  RVDj.  The  merit 
score  is  a  measure  of  the  amount  of  resource  provided  by  a  mode  per  unit  disturbance 
incurred. 

c.  We  try  to  put  the  one  with  maximal  merit  score  into  the  pseudo  bin.  Call  this  mode  as  the 
candidate  bin  resident  and  delete  it  from  die  candidate  movable  mode  list. 

d.  Add  up  die  total  disturbance  scores  of  the  bin  resident  modes  and  this  candidate  bin  resi¬ 
dent.  If  this  total  score  is  less  than  S  ,  then  the  candidate  becomes  a  resident  of  die  pseudo 
bin. 

7.  Report  all  the  modes  in  the  pseudo  bin  as  movable  modes  to  die  operator  for  approval  and 
possible  alteration. 


5.2  The  New  Multi-dimensional  FFD  algorithm 

The  steps  to  carry  out  the  new  muld-dimcnsional  FFD  algoridim  arc  as  follows. 

1.  Augment  each  processor  with  a  set  of  pseudo  joint  criticality  resources,  each  of  which  corresponds 
to  a  joint  criticality  index. 

2.  List  all  the  movable  modes  in  decreasing  order  according  to  their  linear  criticality  score. 

3.  List  all  the  processors  in  decreasing  order  according  to  their  minimal  capacities.  For  example,  if 
processor  A  has  10%  available  Cl’U  cycles  and  20%  available  menioiy,  dicn  the  minimal  capacity 
is  10%. 

4.  Take  the  first  mode  in  the  list  and  repeat  die  following  until  eitiior  the  list  is  empfied  or  die 
recoidlguradon  fails. 

a.  List  all  the  iirocesscs  in  the  iri<'de  in  decreasing  order  according  to  their  maximal  real 
resource  rc;|uiremeiit  such  as  L  i’U,  memory,  ports  etc.  For  exrirnjde  if  a  ['roe'. r,,  needs  20''« 
CPU  and  10%  luemory,  its  maximal  resource  requirciricnt  is  0.2. 


■  Oiii  I  ■  '  Ki.ilU  L'C-.'  U  a  ,  :  II  Il|.- Sue.  1  i:'.  l!',"- IM  I'  fl'UAi".  r.  S  ..u,,-  ‘V;  'Uu  '!, 


■ ,!  ,,  III  ,;a,  Iniu  a 


339 


b.  Take  the  first  process  in  the  process  list  and  repeat  the  following  until  cither  the  list  is 
emptied  or  reconfiguration  fails. 

i.  Try  to  put  the  process  in  the  first  processor  in  the  ordered  processor  lisL 

ii.  Check  all  the  constraints  in  all  the  dimensions,  real  or  pseudo. 

iii.  If  all  the  constraints  are  satisfied,  then  this  process  becomes  the  resident  of  the  proces¬ 
sor.  In  this  case,  delete  the  process  from  the  list. 

iv.  If  the  process  cannot  be  fit  into  any  processor  then  report  to  operator  that  recon¬ 
figuration  fails. 

c.  If  the  list  of  processes  is  emptied,  then  delete  the  associated  mode  from  the  list  of  movable 
modes. 


5.3  The  New  Clustering  Algorithm 

The  steps  of  the  new  clustering  algorithm  are  as  follows. 

1.  List  all  the  movable  modes  in  decreasing  order  according  to  their  linear  criticality  score. 

2.  List  all  the  processors  in  decreasing  order  according  their  minimal  available  capacities  as  defined 
in  the  previous  section. 

3.  Take  the  first  mode  in  the  list  and  repeat  the  following  until  either  the  list  is  emptied  or  the 
reconfiguration  fails. 

a.  Draw  the  communication  graph  of  all  tlic  processes  in  the  mode. 

b.  Identify  the  maximal  arc. 

c.  Put  one  of  the  two  nodes  joined  by  the  arc  on  a  merging  list. 

d.  In  tlie  graph,  mark  tlic  node  on  the  merging  list  as  the  merged  node. 

e.  Repeat  the  following  until  all  the  nodes  in  tlie  graph  are  put  into  the  list. 

i.  In  the  graph,  identify  the  node  which  communicates  most  Iteavily  with  die  merged 
node. 

ii.  I  .abel  the  identified  node  as  die  candidate  node. 

iii.  Inst  die  candidate  node  in  the  mciuing  list. 

iv.  In  the  grapli,  merge  the  candidate  node  with  the  merged  node. 

f  Take  the  first  processor  in  the  processor  list.  Repeat  t!ie  following  until  the  merging  li-.t  is 
emptied. 

i.  fake  die  first  nod.:  in  the  merging  list,  delete  it  i'l.'iii  tlic  li^t  ,'i’d  tiy  to  lit  it  in  liie 
processor. 

n.  (  !v.' :k  .ill  th  ;  isar  n.  i'U  :  real  or  o' (••'(!(). 


340 


iii.  If  successful,  then  this  node  is  a  resident  of  the  processor, 

iv.  If  a  node  becomes  a  resident  of  a  processor,  delete  it  from  the  communication  graph. 

g.  Delete  the  processor  from  the  processor  list 

h.  If  the  processor  list  is  emptied  but  the  communication  graph  is  not,  then  report  that  the 
reconfiguration  fails. 

L  If  the  processor  list  is  not  emptied  but  the  communication  graph  is,  then  delete  the  mode 
from  the  movable  mode  list 


6  The  Reconfiguration  Procedure 

In  this  section,  we  first  list  the  reconfiguration  routine.  Next,  we  list  the  overall  reconfiguration  procedure. 


6.1  The  Reconfiguration  Routine 

We  list  the  reconfiguration  routine. 

1.  Run  the  reverse  FFD  algorithm  to  obtain  the  movable  mode  list 

2.  If  it  is  a  processor  failure,  then  run  the  new  multi- dimensional  FFD  algorithm.  If  this  FFD 
algorithm  fails,  then  report  that  the  reconfiguration  fails.  In  addition,  report  the  modes  that  are 
still  disabled  and  exit  this  routine. 

3.  Check  the  communication  constraints. 

4.  If  the  communication  constraints  are  satisfied,  report  to  the  operator  that  the  reconfiguration  is 
successful,  and  exit  this  routine. 

5.  Otherwise,  run  the  clustering  algorithm.  If  successful,  report  to  the  operator  that  tlic  recon¬ 
figuration  is  successful,  and  exit  this  routine.  If  unsuccessful,  report  that  the  reconfiguration  fails. 
In  addition,  report  the  modes  that  are  disabled. 


6.2  The  Overall  Reconfiguration  Procedure 

When  a  failure  occurs,  do  tlic  following: 

1.  Report  to  the  operator  tlic  modes  disabled  by  the  failure. 

2.  T17  to  use  spares  if  tficrc  arc  any.  If  this  is  successful,  report  to  the  operator  tliat  the  rccon- 
figumtion  is  successful. 

3.  If  this  fails,  run  the  reconfiguration  routine  at  zero  level  disturbance  and  report  die  result  to  tlie 
operator. 

4.  I  ct  the  operator  detoiniinc  if  ihcre  is  .uiy  mode  that  can  be  shed  and  what  should  be  the  allow¬ 
able  disturbance  threshold  level,  or  flic  set  of  the  movable  modes. 

5.  Run  the  rcconfigu'-.,t'/,ti  routine  accordingly  .md  report  the  result. 


341 


7  Conclusion 

Dynamic  system  reconfiguration  algorithms  are  important  to  the  survivability  of  a  tactical  decision  system 
that  must  operate  in  hostile  and  ever  changing  environments.  We  have  sucessfully  developed  a  system  of 
algorithms  that 

1.  minimize  the  disturbance  to  the  functioning  tasks  in  the  course  of  reconfiguration, 

2.  reduce  the  vulnerability  to  future  system  failures, 

3.  and  operate  in  real-time. 


343 


Algorithmic  Processor  Scheduler 


K  Douglas  Jensen,  John  P.  Lehoczky 
Martin  S.  McKendry,  Lui  Sha,  and  Samuel  Shipman 


344 


1 .  Processor  Scheduling 

In  the  course  of  investigating  real  dme  operating  systems,  we  have  become  aware  of  a  problem  in  many 
existing  real-time  executives  that  can  have  profound  impact  on  the  availability,  maintainability,  and  exten¬ 
sibility  of  the  system.  The  problem  is  that  in  these  systems  there  is  essentially  no  real-time  system-level 
support  for  the  management  of  time  in  processors  in  order  to  meet  hard  real-time  deadlines. 

Building  systems  that  can  perform  demanding  tasks  with  hard  real-time  deadlines  is  very  difficulL  This  is 
generally  recognized  by  people  working  in  the  field.  For  example,  A.  K.  Mok  [Mok  83]  described  the  process 
of  building  such  a  system  as  follows; 

Current  practice  in  designing  systems  for  the  hard  real-time  environment  is  rather  ad  hoc. 
There  are  few  tools  to  verify  that  timing  constraint  specifications  can  invariably  be  satisfied.  In 
fact,  systems  are  often  built  with  little  provision  for  guaranteeing  that  stringent  timing  specifica¬ 
tions  can  be  met  Performance  evaluation  is  accomplished  either  by  stochastic  simulation  or  by 
actual  measurements  on  prototype  testbeds,  and  system  performance  is  improved  by  fine  tuning 
certain  system  parameters.  When  specifications  cannot  be  met,  structural  modifications  may 
become  necessary  so  that  at  least  the  major  performance  objectives  can  be  achieved.  While  this 
iterative  approach  seems  to  suffice  for  building  the  less  critical  data  processing  systems,  it  is  not 
suitable  for  systems  which  must  function  in  a  hard  real-time  environment.  Unless  the  interactions 
among  different  components  of  a  system  arc  well  understood  and  taken  into  account  in  the  design 
process,  there  is  no  easy  way  to  simultaneously  satisfy  multiple  stringent  timing  constraints  by  fine 
tuning.  Furthermore,  there  is  no  absolute  guarantee  that  a  timing  constraint  will  invariably  be  met 
by  the  use  of  stochaatic  simulation  or  by  taking  performance  measurements  for  a  limited  set  of 
load  condidons.  For  this  reason,  most  current  systems  are  better  categorized  as  soft  real-time  in 
that  they  lack  hard  guarantees  on  vital  performance  characteristics. 


One  reason  this  approach  has  been  taken  is  that  the  theoredcal  results  necessary  to  do  better  are  not  widely 
known,  and  in  some  cases  are  new.  In  fact,  many  aspects  of  the  problem  are  active  areas  of  research. 
Furthermore,  designers  of  real-time  operating  systems  have  traditionally  been  content  to  provide  a  set  of 
primitive  operations  with  which  to  perform  scheduling,  and  leave  the  problem  of  actually  building  a  schedul¬ 
ing  structure  to  the  applications  programmers,  with  predictably  ad  hoc  results. 


Given  the  tools  they  have  had  to  work  with,  programmers  have  done  surprisingly  well.  But  those  tools  are 
inadequate  for  dealing  with  demanding  hard  real-time  environments.  In  complex  distributed  systems,  the 
need  is  urgent  for  a  unified  system-level  scheduling  strategy. 


In  this  report,  we  first  discuss  the  problems  with  the  existing  approach  and  point  out  the  important  benefits 
to  be  gained  from  an  algoritlimic  scheduler.  We  then  give  a  tutorial  on  the  basics  of  scheduling  theory  and 
issues  to  be  addressed  in  developing  an  algorithmic  scheduler.  Further  work  on  the  problem  is  presented  in 
subsequent  reports  in  this  volume. 


345 


1.1  Problems  with  Existing  Approaches 

In  this  section,  we  describe  a  hypothetical  scheduling  facility  that  is  similar  to  most  existing  ones  in  real  time 
executives.  Generally,  a  scheduling  facility  supports  the  notion  of  programs  which  are  composed  of  tasks. 
Tasks  are  the  schedulable  entities.  Tasks  witliin  a  single  program  can  communicate,  synchronise,  and  alter 
the  priority  of  other  tasks  using  system  services.  However,  no  task  can  directly  affect  a  task  in  a  different 
program  except  by  using  the  interprocess  communication  facilities.  The  set  of  tasks  running  on  a  particular 
processor  is  termed  a  processor  load 

Tasks  have  two  possible  dispatching  states:  ready  to  run  and  blocked.  Each  task  also  has  a  numerical  priority 
When  the  dispatcher  runs,  it  dispatches  the  highest-priority  task  tliat  is  ready  to  run.  A  new  task  is  run  when 
the  currently  running  task  becomes  blocked  (typically  by  waiting  for  some  system  service)  or  wlien  some 
higher-priority  task  becomes  ready  to  run  (e.g.,  upon  I/O  completion).  There  are  system  calls  to  create  and 
destroy  tasks,  and  to  alter  the  priority  of  any  task. 

Priorities  are  assigned  to  the  tasks  within  the  loads  on  die  basis  of  system  simulation  results  or  in  whatever 
way  the  individual  software  developers  see  fit.  No  global  policy  is  used  to  determine  these  priorities.  Ap¬ 
parently,  they  are  based  on  the  programmer’s  perception  of  the  importance  of  the  task  relative  to  other  tasks 
in  the  same  program.  Some  of  the  ideas  such  as  giving  higher  priorities  to  I/O  activities  are  borrowed  from 
techniques  to  maximize  throughput  in  time  sharing  systems,  llie  essential  problem  with  ad  hoc  priority 
assignment  methods  is  the  lack  of  understanding  of  the  relationship  between  throughput,  task  importance, 
and  task  deadlines. 

After  assigning  priorities  to  tasks,  each  ioad  is  then  individually  tested.  If  one  fails  to  meet  any  of  its 
deadlines,  an  attempt  is  made  to  modify  it  so  no  deadlines  are  missed.  A  modification  may  involve  adding  or 
deleting  one  or  more  programs  from  the  load,  but  usually  less  drastic  measures  are  attempted  first.  Such 
measures  may  involve  changing  task  priorities,  changing  program  logic  (e.g.,  the  order  of  certain  events,  or 
synchronization),  making  performance  improvements,  ete.  The  programs  in  the  load  are  thus  modified  until 
they  pass  the  tests. 

This  process,  which  we  will  refer  to  as  "hand-crafting”  processor  loads,  must  be  performed  for  each  proces¬ 
sor  load,  and  may  have  to  be  repeated  when  changes  are  made  to  the  software — even  for  changes  that  seem 
unrelated  to  scheduling. 

If  the  tasks  in  a  particular  processor  load  arc  not  co-schedulable  (i.e.,  there  arc  conditions  under  which  some 
task  will  be  forced  to  miss  its  deadline),  it  is  hoped  that  these  conditions  will  become  apparent  durirg  system 
simulation  or  during  testing,  so  they  can  be  dealt  with  as  described  above.  Failing  that,  it  is  hoped  that  if 
these  conditions  do  not  occur  during  testing,  they  will  also  not  occur  while  the  system  is  in  actual  use. 
Unfortunately,  such  conditions  are  most  likely  to  occur  in  situations  in  which  the  system  is  under  stress,  sucli 
as  battle  conditions. 


346 


1.1  Problems  with  Existing  Approaches 

In  tliis  section,  we  describe  a  hypothetical  scheduling  facility  that  is  similar  to  most  existing  ones  in  real  time 
executives.  Generally,  a  scheduling  facility  supports  the  notion  of  programs  which  are  composed  of  tasks. 
Tasks  are  the  schedulable  entities.  Tasks  within  a  single  program  can  communicate,  synchronize,  and  alter 
the  priority  of  other  tasks  using  system  services.  However,  no  task  can  directly  affect  a  task  in  a  different 
program  except  by  using  the  interprocess  communication  facilities.  The  set  of  tasks  running  on  a  particular 
processor  is  termed  a  processor  load. 

Tasks  have  two  possible  dispatching  states:  ready  to  run  and  blocked.  Each  task  also  has  a  numerical  priority 
When  the  dispatcher  runs,  it  dispatehes  the  highest-priority  task  that  is  ready  to  nin.  A  new  task  is  run  when 
tlie  currently  running  task  becomes  blocked  (typically  by  waiting  for  some  system  service)  or  when  some 
higher-priority  task  becomes  ready  to  mn  (e.g.,  upon  I/O  completion).  There  are  system  calls  to  create  and 
destroy  tasks,  and  to  alter  the  priority  of  any  task. 

Priorities  are  assigned  to  the  tasks  within  the  loads  on  the  basis  of  system  simulation  results  or  in  whatever 
way  die  individual  software  developers  see  fit.  No  global  policy  is  used  to  determine  these  priorities.  Ap- 
parendy,  they  arc  based  on  the  programmer’s  perception  of  the  importance  of  the  task  relative  to  other  tasks 
in  the  same  program.  Some  of  the  ideas  such  as  giving  higher  priorities  to  I/O  acdvitics  arc  borrowed  from 
techniques  to  maximize  diroughput  in  dme  sharing  systems.  The  essential  problem  with  ad  Iioc  priority 
assignment  methods  is  die  lack  of  understanding  of  the  relationship  between  diroughput,  task  importance, 
and  task  deadlines. 

After  assigning  priorities  to  tasks,  each  load  is  then  individually  tested.  If  one  fails  to  meet  any  of  its 
deadlines,  an  attempt  is  made  to  modify  it  so  no  deadlines  arc  missed.  A  modification  may  involve  adding  or 
delcdng  one  or  more  programs  from  the  load,  but  usually  less  drasdc  measures  are  attempted  first.  Such 
measures  may  involve  changing  task  priorities,  changing  program  logic  (e.g.,  the  order  of  certain  events,  or 
synchronizadon),  making  performance  improvements,  etc.  The  programs  in  the  load  are  thus  modified  undl 
they  pass  the  tests. 

I  his  process,  which  we  will  refer  to  as  "hand-crafdng"  processor  loads,  must  be  performed  for  each  proces¬ 
sor  load,  and  may  have  to  be  repeated  when  changes  are  made  to  the  software— even  for  changes  diat  seem 
unrelated  to  scheduling. 

If  the  tasks  in  a  pardcular  processor  load  arc  not  co-schcdulable  (i.c.,  there  arc  conditions  under  which  some 
task  will  be  forced  to  miss  its  deadline),  it  is  hoped  that  these  conditions  will  become  apparent  during  system 
simulation  or  during  testing,  so  they  can  be  dealt  with  as  described  above.  Failing  that,  it  is  hoped  that  if 
these  conditions  do  not  occur  during  testing,  they  will  also  not  occur  while  the  system  is  in  actual  use. 
Unfortunately,  such  conditions  .ire  most  likely  to  occur  in  situations  in  which  tlic  system  is  under  stress,  such 
as  battle  conditions. 


347 


This  method  of  system  integration  has  profound  impacts  on  many  aspects  of  die  system.  We  explain  these 
impacts  beiow. 

1.1.1  System  Integration 

At  system  integration  time,  each  processor  load  is  hand-crafted  so  that  it  will  meet  its  real-time  constraints. 
This  can  be  a  very  time-consuming  process.  Timing-dependent  problems,  especially  transient  ones,  are  the 
hardest  bugs  to  find  in  a  computer  system.  In  the  absence  of  a  scientific  methodology  for  managing  task 
priorities,  the  current  practice  can  only  be  characterized  as  an  ad  hoc,  trial  and  error  approach,  depending  on 
the  skill,  intuition,  and  luck  of  the  system  integrator  for  any  chance  of  success. 

Transient  timing  problems  arc  bound  to  occur  in  such  a  system,  due  to  the  low-level  nature  of  the  dispatch¬ 
ing  primitives  used  to  effect  scheduling  and  the  lack  of  an  operating-system  level  scheduler.  Any  change  in 
the  timing  environment  can  perturb  such  a  system,  including  but  not  limited  to  changes  in  network  activity  or 
operating  system  executive  activity.  (These  dependencies  are  at  the  root  of  a  wide  spectrum  of  timing 
problems,  and  could  be  largely  eliminated  by  implementing  a  processor  scheduler.  See  Section  1.2.)  Fixing 
these  uming  problems,  as  well  as  problems  with  the  functionality  of  the  software  (as  opposed  to  its  timing) 
will  require  changes  to  the  programs. 

Of  course,  changes  to  any  program  could  potentially  affect  the  schedulability  of  every  processor  load  of 
which  that  program  is  a  part.  It  is  even  conceivable  that  a  change  in  one  program  could  affect  processor  loads 
of  which  it  is  not  a  part,  due  to  its  possible  effects  on  network  activity,  which  could  ripple  through  the  system, 
given  tlie  sensitivity  of  the  current  scheduling  methods  to  such  factors. 

So,  in  the  event  of  any  change  (and  changes  are  relatively  frequent  at  this  stage  of  the  system  life-cycle)  the 
time-consuming  process  of  testing  and  debugging  tlie  timing  of  every  processor  load  of  which  that  program  is 
a  part  is  begun  again.  And  every  load  tliat  fails  to  pass  tlie  tests  must  go  tlirough  the  hand-crafting  cycle  again 
to  make  sure  it  satisfies  die  scheduling  requirements. 

There  are  certain  trade-offs  which  can  affect  the  seriousness  of  the  problem  at  system  integration  time.  One 
involves  utilization.  The  higher  the  degree  of  utilization  at  which  processors  arc  to  be  am,  die  more  likely 
that  a  given  processor  load  will  not  be  schedulable.  This  means  that  more  exercises  in  hand-crafting  will  be 
required,  and  that  these  exercises  will  be  more  difficult  and  more  time-consuming  to  successfully  complete. 
The  alternative  is  to  run  at  lower  utilization  levels,  sacrificing  a  large  degree  of  system  availability  and 
fault-tolerance. 

Another  trade-off  concerns  flexibility  in  system  reconfiguration,  which  also  directly  affects  system 
availability.  The  more  varied  processor  loads  that  arc  available,  the  more  flexible  the  reconfiguration  process 
will  be,  and  the  better  the  system  will  be  able  to  recover  from  hardware  failures.  But  if  developing  and  testing 


348 


each  processor  load  is  an  expensive  process,  then  diere  will  be  pressure  to  produce  fewer  processor  loads.  This 
will  reduce  reconfiguration  flexibility. 

The  current  handcraft  approach  also  significantly  impacts  the  testability  of  system  timing.  A  major  aspect 
of  testability  is  the  observability  of  system  state.  In  68K-OS,  the  information  needed  to  schedule  tasks  is 
spread  out  over  alt  the  tasks  in  a  processor  load,  making  it  difficult  to  access  during  debugging.  In  a  system 
widi  a  scheduler,  the  scheduling  information  would  be  localized  within  the  scheduler,  in  its  proper  context, 
and  easily  accessible.  Another  important  aspect  is  simply  the  complexity  of  the  system.  The  operation  of 
scheduling,  as  we  have  seen,  is  considerably  more  complex  than  it  would  be  under  an  operating  system  with  a 
scheduler.  This  complexity  makes  the  system  more  difficult  to  test  and  debug. 

Using  the  handcraft  approach,  the  amount  of  testing  that  must  be  done  to  establish  confidence  in  the  system 
is  quite  large.  In  addition  to  testing  the  Junctionality  of  the  system,  the  schedulability  of  the  system  must  be 
continually  tested  as  well.  If  there  were  a  processor  scheduler  common  to  all  the  software  in  the  system, 
confidence  in  the  ability  of  that  scheduler  to  correctly  schedule  the  system  would  be  built  up  incrementally 
with  each  test  Furthermore,  each  test  of  each  program  would  present  an  opportunity  to  discover  bugs  in  the 
one  piece  of  software  concerned  with  scheduling,  namely  the  scheduler. 

As  it  is  now,  a  new  scheduler,  in  effect,  is  devised  for  every  processor  load.  Extra  testing  must  be  done  to 
test  both  this  "scheduler"  and  the  programs  it  schedules.  Tliis  schedulability  testing  can  be  more  time- 
consuming  than  testing  the  functionality  of  the  code,  since  timing  problems  are  generally  more  difficult  to 
detect,  find,  and  fix  than  other  types  of  bugs. 

Of  course,  this  is  not  to  imply  that  a  better  scheduling  facility  would  eliminate  the  need  for  acceptance 
testing  of  the  system.  But  since  no  amount  of  testing  will  absolutely  guarantee  the  absence  of  bugs,  and  only  a 
finite  amount  of  testing  can  be  done,  then  it  is  better  to  allocate  the  available  testing  time  to  best  advantage. 
Having  a  scheduler  frees  much  of  the  time  the  old  method  would  have  used  for  testing  the  schedulability  of 
every  processor  load,  and  allows  it  to  be  used  to  test  other  aspects  of  the  system. 

1.1.2  Post-Deployment  Support 

The  concerns  about  the  impact  of  system  scheduling  in  the  system  integration  phase  become  even  more 
important  after  the  system  is  deployed.  Actual  use  of  the  system  will  inevitably  reveal  changes  that  need  to  be 
made,  either  to  fix  problems  with  the  software,  or  to  enhance  its  functionality  or  performance. 

Consider  what  happens  when  a  single  program  in  a  processor  load  is  changed.  Any  change  is  likely  to  alter 
the  timing  behavior  of  the  program.  The  new  behavior  will  then  be  different  from  the  behavior  it  displayed 
during  system  testing.  Thus  ea;h  processor  load  of  which  it  is  a  part  must  be  tested  again,  not  to  test  the 
functionality  of  die  change — one  pf(x:cs.sor  load  test  would  suffice  for  that — but  to  test  tlie  schedulability  of 


349 


each  of  the  processor  loads.  It  is  possible  that  the  change  would  render  some  subset  of  the  loads  unschedul- 
able,  in  which  case  the  cycle  of  "hand-crafting"  would  be  entered  again,  as  described  above. 

Of  course,  the  same  applies  to  any  hardware  upgrade  that  alters  the  timing  behavior  of  the  hardware. 
Performance  upgrades  fall  into  this  category.  Thus,  if  more  capable  processors  were  to  be  used,  the  entire  set 
of  processor  loads  would  have  to  be  reformulated  in  order  to  take  advantage  of  the  added  performance.  If 
two  versions  of  a  processor,  different  only  in  speed,  were  to  co-exist  in  tlie  system,  two  sets  of  processor  loads 
would  have  to  be  developed  if  the  performance  advantage  of  the  faster  processors  were  to  be  exploited. 

As  we  mentioned  above,  this  relatively  expensive  cycle  of  changes  occurs  fairly  often  during  system  integra¬ 
tion,  and  must  be  tolerated  at  that  point  in  order  to  finally  arrive  at  a  working  system.  Once  the  system  is 
working  (for  some  suitable  definition  of  "working")  and  is  deployed,  the  cost  of  any  change  rises  dramati¬ 
cally. 

The  point  is  that  upgrades,  of  either  software  or  hardware,  can  become  very  difficult  when  the  current 
scheduling  methods  are  used.  In  fact,  since  timing  is  the  most  difficult  part  of  a  real-time  system  to  get  right, 
there  would  be  pressure  to  avoid  any  changes  that  could  introduce  timing  problems,  given  that  the  means  of 
detecting  and  dealing  with  them  are  so  unwieldy.  There  are  already  instances  of  this.  A.  K.  Mok  [Mok  83] 
cites  the  following  news  item  about  the  F18  aircraft  appeared  in  ACM  SIGSOFT  Software  Engineering 
Notes  [SEN  81];  "Apparently  the  effort  that  has  gone  into  developing  and  testing  the  softv/arc  is  so  extensive 
that,  when  changes  are  required,  it  is  now  preferable  to  modify  the  plane  to  fit  the  existing  software,  rather 
than  to  modify  the  software  to  match  the  plane.”  Glenford  Myers  [Myers  76]  states  that  the  costs  of  main¬ 
tenance  and  testing  account  for  roughly  75%  of  the  cost  of  software,  and  cites  several  examples,  including  the 
SAGE  system  which  cost  $20  million  per  year  after  10  years  of  operation,  compared  to  its  initial  development 
cost  of  $250  million.  A  study  by  B.  W.  Boehm  of  TRW  Corporation  indicated  that  the  life-cycle  cost  of 
real-time  software  products  has  been  diree  times  as  much  as  that  for  analytical  software  products. 

1 .2  The  Solution:  A  Processor  Scheduler 

1.2.1  Proposed  Solution 

We  propose  that  implementing  a  processor  scheduler  as  part  of  the  operating  system  is  a  significant  step 
toward  addressing  the  problems  we  have  discussed.  The  processor  scheduler  should  implement  a  proven 
scheduling  policy.  It  should  require  that  the  applications  provide  accurate  information  on  which  to  base 
scheduling  decisions,  but  will  otherwise  be  independent  of  the  application  code.  The  rationale  for  such  a 
scheduler  and  its  potential  benefits  are  discussed  below. 


350 


1.2.2  Rationale  and  Benefits 

It  is  evident  that  lime  is  an  important  resource  in  a  real-time  system.  The  management  of  time  is  as 
important  as  the  management  of  databases  or  communications.  Yet  there  is  no  on-line  system-level  support 
for  the  management  of  processor  time.  Managing  time  involves,  among  other  things,  scheduling  traffic  on  the 
communications  bus,  scheduling  processes  to  run  on  processors,  and  coordinating  across  processors.  A 
system-level  strategy  for  managing  time  is  needed,  and  a  processor  scheduler  is  a  step  toward  tliat  strategy. 

We  have  seen  that  a  processor  scheduler  is  needed  to  hnplement  scheduling  policies.  Theoretical  work  on 
scheduling  shows  that  certain  policies  are  optimal  in  given  circumstances,  and  indicates  the  possibility,  with 
further  work,  of  developing  processor  scheduling  algorithms  that  arc  efficient  and,  perhaps  more  importantly, 
are  guaranteed  to  correctly  schedule  a  schedulable  set  of  processes. 

A  scheduler  based  on  a  scientific  scheduling  policy,  having  the  appropriate  mechanisms  to  implement  that 
policy,  could  potentially  eliminate  transient  timing  problems.  This  could  greatly  lower  system  testing  and 
integration  costs,  since  such  timing  errors  account  fur  a  large  fraction  of  those  costs.  Independence  from  such 
transient  elTects  also  enhances  modularity,  since  such  dependencies  would  no  longer  affect  tlie  applications 
code. 

A  processor  scheduler  also  greatly  enhances  the  modularity  of  the  system,  in  the  sense  tliat  separating  the 
concern  of  scheduling  from  the  concern  of  correctly  Ainctioning  applications  code  implies  that  changes  to  that 
code  be  independent  of  scheduling  issues.  Either  a  set  of  processes  is  schedulable  or  not.  If  the  scheduler 
determines  that  they  are  not,  then  a  different  set  of  processes  must  be  used.  If  tliey  are  schedulable,  then 
there  is  no  question  of  "hand-crafting"  them  to  meet  deadlines. 

This  greatly  improved  modularity  has  profound  effects  on  the  feasibility  of  post-deployment  support. 
Programs  are  no  longer  interdependent  with  regard  to  scheduling.  'ITie  scheduler  handles  those  problems, 
and  programs  can  be  modified  without  particular  concern  for  schedulability,  beyond  whctlier  enough  proces¬ 
sor  resources  will  be  available  to  support  the  enhanced  program.  This  means  that  system  maintenance  and 
upgrades  to  software  will  be  more  straightforward.  Further,  a  parametrized  implementation  of  the  processor 
scheduler  would  enable  performance  upgrades  to  the  hardware  to  be  incorporated  without  necessitating 
modifications  to  the  applications  software. 

System  modularity  also  improves  testability  of  system  timing,  since  information  related  to  scheduling  is 
localized  in  the  scheduler  and  so  is  more  easily  accessible  than  if  it  is  spread  among  all  the  applications  tasks. 
The  entire  scheduling  system  is  less  complex,  since  it  is  localized  in  a  scheduler  module  in  the  operating 
system,  which  has  wcil-dcfincd  interfaces  and  intcrtictions  with  applications  tasks  and  other  parts  of  the 
system.  Suucturing  a  scheduler  as  a  single  module  means  that  it  is  possible  to  test  that  module  as  a  unit. 
Furtlicrmore,  every  test  of  applications  software  incidentally  tests  the  scheduler  as  well,  in  contrast  to  the 
current  method,  in  which  the  scheduling  of  each  prrK’CSsor  load  is  dilYcrcnt  and  must  be  tested  separately. 


Fixes  to  problems  found  within  the  scheduler  benefit  all  processor  loads,  whereas  scheduling  problems  found 
in  tlie  current  method  only  apply  to  the  particular  processor  load  in  which  they  arc  encountered. 

Aside  from  these  considerable  benefits,  a  processor  scheduler  paves  the  way  for  process  level  reconfigura¬ 
tion,  a  method  of  responding  to  hardware  failures  that  has  enormous  advantages  over  the  current  metliod  of 
processor  level  reconfiguration  Tlie  latter  method  is  dictated,  in  part,  by  the  lack  of  a  processor  scheduler. 

Another  advantage,  in  terms  of  reusable  software,  is  tliat  a  scheduler  developed  for  a  system  would  be 
useful  in  other  real  time  systems. 

These  goals  can  only  be  achieved  if  a  longstanding  tradition  of  real-time  systems  is  broken.  Heretofore,  the 
major  responsibility  of  the  developers  of  real-time  operating  systems  was  seen  to  be  providing  a  set  of 
primitive  operations  for  scheduling.  These  primitives  were  to  be  used  by  tlie  application  programmers  to 
build  their  own  scheduling  structures  as  they  saw  fit.  The  operating  system  assumed  no  responsibility  for 
schedulability,  which  was  just  as  well,  since  most  (if  not  all)  such  systems  could  provide  no  guarantees. 

However,  in  a  highly  complex  system  this  is  insufficient.  There  is  a  need  for  a  scheduling  service  to  be 
provided  to  applications,  which  obey  a  discipline  of  using  the  scheduler’s  operations,  in  exchange  for  some 
guarantee  of  schedulability.  The  operating  system  must  assume  rcsponsiblity  for  schedulability.  because  it  is 
the  logical  place  in  the  system  for  the  implementation  of  a  scientific  scheduling  policy.  The  operating  system 
should  provide  an  environment  for  applications  software  in  which  the  scheduling  issues  are  handled  for  it, 
just  as  current  operating  systems  provide  an  environment  for  processes  in  which  control  of  I/O  devices  is 
handled  by  the  operating  system  on  their  behalf. 

The  current  situation  with  respect  to  scheduling  is  analogous  to  the  situation  that  prevailed  in  programming 
languages  before  the  advent  of  structured  programming.  The  programmer  was  provided  with  a  powerful 
primitive  for  control  stnicturing,  the  GOTO  statement,  and  was  expected  to  use  it  to  build  whatever  control 
structures  he  saw  fit.  Since  structured  programming  has  been  accepted  by  most  of  the  computing  community, 
and  languages  have  emerged  which  support  this  style,  tlie  days  of  the  GOTO-fillcd  program  arc  not  missed  by 
the  majority  of  programmers.  They  arc  willing  to  sacrifice  some  degree  of  power  for  the  benefits  of  undcr- 
standability  and  maintainability  that  a  structured  approach  to  the  flow  of  control  in  programs  confers. 

In  the  case  of  scheduling,  real-time  programmers  should  be  ready  to  give  up  the  power  of  direct  control 
over  the  operating  systems  scheduling  actions  in  favor  of  the  benefits  of  modularity  and  guaranteed  schedul¬ 
ing  that  an  operating  system-level  processor  scheduler  confers.  And,  as  programming  language  designers 
were  charged  with  the  task  of  providing  useful  and  usable  control  structuring  primitives,  the  real-time  operat¬ 
ing  system  designers  are  now  charged  with  providing  useful  and  usable  scheduling  facilities. 


352 


2.  Scheduling  Theory 

2.1  Basic  Concepts 

2.1 .1  Policy  and  Mechanism 

It  is  useful  to  view  a  process  scheduling  facility  as  consisting  of  two  conceptual  parts:  policy  and  mechanism. 
The  policy  determines  how  scheduling  decisions  are  made,  and  the  mechanism  is  the  agency  by  which  they 
are  carried  out  Policy  and  mechanism  are  directly  analogous  to  the  concepts  of  strategy  and  tactics.  The 
policy/mechanism  dichotomy  has  useful  application  to  other  forms  of  resource  management  as  well.  The 
Archons  project  of  Carnegie-Mellon  University  contains  many  useful  examples. 

To  illustrate  this  view,  we  will  briefly  describe  the  general  scheduling  model  developed  by  Ruschitzka  and 
Fabry  [Rusch  77]  In  this  model,  a  scheduling  policy  consists  of  three  parts: 

•  a  decision  mode 

•  a  priority  function 

•  an  arbitration  rule 

The  decision  mode  determines  when  a  rescheduling  occurs.  Common  examples  of  the  decision  mode  are 
non-preemptive,  pre-emptive,  or  quantum-oriented  pre-emptive  (in  which  a  task  is  allowed  to  execute  con¬ 
tinuously  for  at  most  a  fixed  period  of  time,  the  quantum,  after  which  a  rescheduling  may  occur). 

The  priority  function  is  an  arbitrary  function  of  task  and  system  parameters  that  assigns  a  priority  to  a  task. 
The  priority  could  be  a  function  of  task  run  time,  task  period,  system  load,  etc.  A  priority  function  can  be 
defined  to  suppo.'t  any  scheduling  policy  including  commonly  used  policies  such  as  FIFO,  round-robin, 
deadline,  or  rate-monotonic  scheduling. 

The  arbitration  rule  resolves  any  conflict  which  may  occur  among  jobs  with  equal  highest  priority.  Ex¬ 
amples  of  possible  arbitration  rules  include  FIFO  and  round-robin. 

In  this  model,  mechanisms  would  be  required  to  support  the  scheduling  policy.  'Ihcse  include  mechanisms 
to  cause  the  scheduler  to  run  at  a  time  determined  by  the  decision  mode,  to  evaluate  the  priority  function,  and 
to  dispatch  the  task  indicated  by  the  priority  function  or  the  arbitration  rule. 


In  following  sections,  we  will  sec  that  it  is  important  that  a  scheduling  facility  provide  good  mechanisms  for 
specifying  and  implementing  scheduling  policies.  It  is  in  this  area  tiiat  the  current  SiibACS  mechanisms  fall 
short 


353 


2.1.2  Policies  and  Performance 

To  understand  die  issues  involved  in  scheduling  real-time  systems  with  hard  deadlines,  it  is  necessary  to 
review  some  results  from  scheduling  theory.  These  results  are  relatively  straightforward  analytically,  but  are 
not  intuitively  obvious  and  consequently  may  be  surprising.  The  discussion  below  is  based  on  the  work  of 
Liu  and  Layland  [Liu  73]. 

2. 1.2.1  Assumptions 

A  number  of  simplifying  assumptions  are  made  in  this  work: 

•  We  are  scheduling  periodic  tasks  with  hard  deadlines. 

•  A  task  has: 

o  a  rcladve  initiation  time,  7, 
o  a  period,  T,  and 

o  an  amount  of  processing,  C,  that  must  be  completed  witliin  the  period  T. 

•  The  tasks  are  independent,  i.e.  there  are  no  precedence  relations  or  synchronization  among  tasks. 

•  Tasks  can  be  instantly  pre-empted. 

•  No  overhead  is  incurred  in  task  swapping,  interrupt  handling,  scheduling,  etc. 

•  Priority  granularity  (number  of  distinct  priority  levels)  is  adequate. 

These  assumptions  are  admittedly  unrealistic;  nevertheless,  they  lead  to  interesting  and  useful  results. 
Much  of  our  work  is  aimed  at  relaxing  these  assumptions  without  losing  the  ability  to  make  guarantees  about 
tlie  schcdulability  of  tasks. 

Under  these  assumptions  we  will  consider  how  tasks  can  be  scheduled  to  ensure  that  all  processing  is  done 
and  all  deadlines  are  met  Witliin  this  framework,  we  will  consider  three  issues: 

•  Can  the  initiation  times  (/)  be  controlled,  or  arc  they  arbitrary? 

•  Can  the  periods  (7)  be  controlled,  or  arc  they  arbitrary? 

•  Is  a  fixed  or  dynamic  priority  scheduling  algorithm  to  be  used? 

In  a  later  section  we  will  consider  the  effects  of  conditions  which  violate  the  assumptions,  c.g.  non-zero 
scheduling  overhead  and  interrupt  processing. 


354 


2.1 .2.2  Examples 

From  tlic  assumptions,  if  a  task  has  a  relative  initiation  time  I  and  period  T,  then  it  is  initiated  at  times 
/,  /+  T,  I+IT,  7+37',  •  •  •  .  The  processing  requirement,  C,  for  the  task  initiated  at  time  I+kT  must  be 
eompletcd  by  time  7+  (A:  +  1)7’. 

Anotlicr  quantity  of  interest  is  the  utilization  U  of  the  task.  The  utilization  is  a  measure  of  how  mueh  of  the 

processor  is  being  used  by  the  task.  A  task  has  utilization  U=C/T.  A  set  of  n  tasks  has  total  utilization 

U  =  C/T  +  C/T  +  •  •  •  +  C/T. 

1  2  2  2  n  n 


Consider  the  example  in  Figure  2-1.  Task  1  has  higher  priority  than  task  2.  Note  that  if  task  2  is  given 
higher  priority,  then  task  1  will  miss  its  deadline  at  time  2.  This  situation  is  illustrated  in  Figure  2-2. 


Task  2 

1 

1  Task  1 

I 

1 

_  o 

-  O 

I  : 

l  V 

0  c. 

0.2 

=  1.0 
=  1.2 

T,  =  2 

T^  =  3 
'■1  ^ 

5.4 

>  6 

Ui  =  0.5  =  0.4  =  0.9 

Figure  2-1:  Two  Periodic  Tasks  Scheduled  Correedy 


Now  consider  the  modified  example  in  Figure  2-3.  Note  diat  task  1  has  higher  priority  than  task  2,  and  task 
2  misses  its  deadline  at  time  3.  However,  if  task  2  is  given  higher  priority,  dien  task  one  will  miss  its  deadline 
at  time  2.  I  his  situation  is  illustrated  in  Figure  2-4.  So  we  can  sec  tliat  this  case  cannot  be  scheduled  using  a 
fixed  priority  s\gon\l\m,  i.c.  an  algoridim  tliat  assigns  an  unchanging  priority  to  each  of  die  tasks. 


A  dynamic  priority  algorithm,  however,  will  work.  In  Figure  2-5  the  two  tasks  arc  scheduled  such  diat  dieir 
deadlines  are  met  by  varying  their  relative  priority.  At  limes,  one  task  has  higher  priority,  while  at  other  dmes 
the  other  uisk  docs.  In  diis  case,  the  algorithm  used  for  assigning  the  priorities  is  known  as  the  deadline 
scheduling  algorithm,  in  wliich  the  task  with  the  nearest  deadiine  has  highest  priority. 

Note  that  in  this  case,  a  fixed  priority  .algorithm  will  work  if  7=0  and  O.l  <  7^  <  0.4  or  0.6  ^  7^  ^  0.9.  In 


1 

1  0.2 

1 

1.4  1  2.2 

1 

1  3.2 

1  '4.4  1 

1 

5.4 

1 

0 

1 

2 

3 

4  5 

6 

I,  =  0 
=  0.2 

o  o 

N>  1— 

II  II 

K>  O 

T^  =  2 

^2  =  3 

=  0.5 

^2  =  0.4 

^total  =  0-9 

' 

Figure  2-2:  Tasks  of  Figure  2-1  Scheduled  Incorrectly 

1 

1 

I 

Ti 

I  3.2 

1 

1 

0 

1 

2 

3 

4 

5 

11  =  0 

12  =  0 

C,  =  1.0 

C2  =  1.2 

T,  =  2 

T2  =  3 

=  0.5 

U2  =  0.4 

Uioul  =  0.9 

Figure  2-3:  Fixed  Prioricy  Unschedulablc  1 

other  words,  60%  of  the  possible  phasings  of  tlic  two  tasks  are  schcdulable,  40%  are  not.  However,  deadline 
scheduling  will  always  v/oik  if  the  total  utilization  U  <  1.0. 

2. 1.2. 3  Results 

The  worst  case  for  scheduling  periodic  tasks  occurs  when  all  initiation  times  arc  the  same  (equal  to  zero). 
Liu  and  Layland  call  this  the  critical  instant.  Since  this  is  the  worst  possible  case,  only  the  first  deadline  of 
each  task  must  be  checked  to  determine  whether  the  tasks  arc  schcdulable.  If  the  first  deadline  after  the 
critical  instant  is  met,  then  all  other  deadlines  will  be  met  as  well.  Liu  and  Layland  analyzed  scheduling 
algorithms  for  Jiis  worst-case  situation  to  determine  tlic  optima!  algorithms  for  fixed  priority  sciicduling  and 
dynamic  priority  scheduling. 


356 


Figure  2-5:  Deadline  Scheduling  of  Tasks 


For  fixed  priority  scheduling,  an  optimal  algorithm  is  the  rate  monotonic  algorithm.  The  rule  for  assigning 
priorities  to  tasks  in  this  algorithm  is  that  task  i  has  priority  over  task  j  if  and  only  if  T.<  T..  If  T.  =  T.,  the 
relative  priorities  are  arbitrary.  (Conceptually,  tasks  i  and  yean  be  combined  into  a  single  task,  which  has  the 
same  effect  on  schedulability  as  the  two  separate  tasks  of  equal  period).  Priorities  are  assigned  based  only  on 
the  periods.  The  C.  are  ignored. 


The  rate  monotonic  algoritlim  can  schedule  any  phasing  of  any  task  set  having  utilization  U  <  log^2'-^.b9i 


357 


(or  h(2'^"-  1)),  for  a  set  of  n  tasks).  Tliis  is  the  worst  case.*^ 

The  deadline  scheduling  algorithm  is  a  dynamic  priority  algorithm  that  is  optimal  in  general.  It  can  schedule 
any  task  set  having  total  utilization  U  <,  1.0. 

A  dynamic  priority  algorithm  can  be  more  difficult  to  implement  than  one  that  uses  fixed  priorities,  but  the 
results  of  Liu  and  Layland’s  analysis  indicate  that  even  the  optimal  fixed  priority  scheduling  algorithm 
sacrifices  over  30%  of  CPU  utilization  if  schedulability  is  to  be  guaranteed.  In  contrast,  the  optimal  (dynamic 
priority)  scheduling  algorithm  sacrifices  no  CPU  utilization.  Therefore,  it  would  be  advantageous  to  use  the 
deadline  scheduling  algorithm,  since  it  is  unlikely  that  the  additional  overhead  it  would  incur  could  even 
approach  the  30%  already  wasted  by  the  rate  monotonie  algorithm. 

2.2  Priority  Inversion:  A  Serious  Problem 

2.2.1  Examples 

Having  introduced  the  basic  concepts  and  results  of  scheduling  theory,  we  now  examine  the  effect  of 
deviating  from  an  optimal  scheduling  algorithm.  Priority  inversion  occurs  when  a  task  runs  at  a  priority 
different  from  that  dictated  by  an  optimal  scheduling  algorithm.  The  schedulability  results  presented  earlier 
are  no  longer  applicable.  It  is  possible  to  find  deadlines  being  missed  with  the  task  set  having  relatively  small 
total  utilization  if  priority  inversions  are  allowed  to  occur. 

Figure  2-2  is  an  example  of  priority  inversion,  lire  tasks  are  given  priorities  that  are  exactly  the  inverse  of 
those  that  would  be  assigned  by  the  rate  monotonie  algorithm,  and  so  they  are  no  longer  schedulable.  This  is 
a  particularly  blatant  example  of  priority  inversion,  but  in  fact  even  a  "small"  violation  of  the  priority 
assigiunents  can  cause  severe  problems. 

Consider  the  following  example: 

Taskl  L=0  C,=0.01  T,  =  l  U,=0.01 

Task  2  I,=0  C,  =  1.00  T,  =  100  U,=0.01 

^otal=0-02(0 

In  this  case,  if  we  schedule  the  tasks  according  to  the  rate  monotonie  algorithm,  then  task  1  has  higher 
priority  than  task  2,  and  the  tasks  are  schedulable.  If,  however,  the  priorities  become  inverted  (task  2  has 
higher  priority)  then  this  task  set  cannot  be  scheduled.  Task  1  will  miss  its  deadline  at  T= 1. 


Hlicrc  is  an  interesting  best  case  for  the  rate  monotonie  algorithm,  in  which  if  the  set  of  tasks  us  totally  harmonic,  i.c.  the  ratio  of  any 
period  to  any  longer  (or  equal)  period  is  an  integer.  The  rate  montonic  algorithm  c.in  schedule  any  l.usk  set  having  this  property,  if  the 
total  ulili/alion  U  2  1.0. 


'fhis  example  is  admittedly  contrived,  but  it  serves  to  illustrate  graphically  the  problems  that  priority 
inversion  can  cause  with  scheduling  even  the  most  (seemingly)  innocuous  task  set.  Priority  inversion  is  a  most 
serious  problem  since,  if  it  occurs,  tliere  is  no  lower  bound he\ow  which  correct  scheduling  can  be  guaranteed. 


359 


3.  Real  Time  Scheduling  Issues 

In  the  previous  section,  we  have  discussed  the  scheduling  theory  under  idealized  circumstances,  introduced 
the  concept  of  priority  inversion  and  illustrated  its  potential  harmful  effects.  In  this  section,  we  discuss  a 
number  of  practical  issues  which  can  cause  priority  inversion  if  they  are  not  handled  properly.  In  addition, 
we  give  an  overview  of  the  approach  used  by  the  Archons  project  at  Camegic-Mellon  University  for  solving 
real  time  scheduling  problems. 

3.1  Sources  of  Priority  Inversion 

In  the  scheduling  of  real  time  tasks,  there  are  three  important  sources  of  possible  priority  inversion: 

•  I/O  scheduling:  the  arbitration  of  DMA  and  interrupts. 

•  Task  interactions:  the  handling  of  task  precedence  relations,  of  inter-process  communication  and 
of  the  synchronization  of  accesses  to  shared  data. 

•  Transient  overloads  to  processors. 

We  discuss  each  of  the  topics  in  turn. 

3.1.1  I/O  Scheduling 

The  first  important  source  of  possible  priority  inversion  in  many  real  time  executives  is  the  inconsistency 
between  the  algorithms  used  to  schedule  the  I/O  activities  and  to  schedule  tasks  in  the  processor.  To 
illustrate  this  point,  let  us  first  examine  the  structure  of  periodic  tasks. 

A  typical  periodic  task  consists  of  three  phases.  The  first  phase  is  the  transfer  of  the  data  from  an  input 
device  into  the  main  memory.  The  second  phase  is  tiie  execution  of  the  computation.  The  third  phase  is  the 
transfer  of  the  results  of  the  computation  to  an  output  device.  All  three  phases  must  be  completed  during  the 
task  period  in  order  for  the  task  to  meet  its  deadline.  Since  a  single  cycle  of  a  real  time  task  includes  both  I/O 
and  computation,  it  is  easy  to  see  that  the  scheduling  of  the  I/O  activities  must  be  consistent  with  the 
scheduling  of  the  computation  activities  of  the  processor.  Nevertheless,  the  general  practice  found  in  many 
real  time  systems  is  to  arbitrate  DMAs  using  either  a  FIFO  or  a  round-robin  scheduling  policy.  At  die  end  of 
a  DMA,  the  I/O  device  interrupts  the  processor  with  an  aribitarily  defined  priority  level  that  is  unrelated  to 
the  algorithm  used  to  schedule  the  processor.  Such  a  practice  is  a  source  of  priority  inversion,  because  the 
I/O  activity  assocated  widi  a  low  priority  task  can  slow  down  either  the  computation  or  the  I/O  activity  of  a 
high  priority  task.  To  exemplify  the  potential  problem  caused  by  this  practice,  let  us  condsidcr  the  following 
example. 

S'jpp'’*:?  v'C  have  5  tasks,  T^ . Tj,  with  periods  10,  20,  30,  40,  and  50.  F.ach  task  has  2  time  units  for 

DMA-in,  2  time  units  for  computation  and  2  time  units  for  DMA-out.  For  simplicity,  wc  further  assume  the 


360 


"cycle  stealing"  or  processor  slowing  down  effect  of  DMA  and  the  overhead  of  handling  DMA  completion 
interrupts  are  negligible.  Suppose  that  at  time  zero,  all  the  S  tasks  are  ready  to  have  their  data  input  by 
DMAs.  Suppose  the  data  input  to  task  has  arrived  somewhat  later  than  others.  If  we  follow  the  FIFO  rule 
for  executing  DMAs,  then  task  cannot  start  until  t  =  10  and  consequently  misses  its  deadline.  On  the 
other  hand,  if  we  use  the  rate  monotonic  scheduling  algorithm  to  schedule  the  computations  and  to  schedule 
DMAs,  then  the  data  are  input  in  the  order  of  task  T1  to  task  TS  and  all  the  deadlines  are  met.  Thus,  in  a  real 
time  operating  system,  it  is  important  to  arbitrate  I/O  activities  in  an  order  that  is  consistent  with  the 
processor  scheduling  algorithm. 

3.1.2  Task  Interactions 

In  many  real  time  operating  systems,  the  second  important  source  of  potential  priority  inversions  is  the 
method  of  handling  task  interactions. 

Tasks  can  have  precedence  relations  and  can  share  data.  In  addition,  tasks  can  interact  with  each  other 
through  system  executive  calls  or  inter-process  communication  (IPC)  mechanisms.  In  most  existing  real  time 
systems,  the  mechanisms  to  handle  task  interactions  are  those  developed  in  the  context  of  time  sharing 
systems,  such  as  semaphores  for  synchronization  and  mailboxes  (or  ports)  for  IPC.  These  mechanisms  were 
developed  without  any  consideration  of  satisfying  real  time  constraints.  For  example,  an  undisciplined  use  of 
semaphores  could  allow  a  low  priority  task  to  block  the  execution  of  a  high  priority  task.  Indeed,  in  many 
existing  systems  the  operating  system  executive  executes  at  the  highest  priority  and  the  execution  on  behalf  of 
a  system  call  executes  at  this  highest  priority.  Consequently,  a  prolonged  system  call  can  lock  out  other  tasks 
that  might  have  become  the  highest  priority  task  during  that  time  period.  In  a  real  time  operating  system, 
mechanisms  for  synchronization,  IPC,  and  system  executive  calls  must  be  stnictured  in  a  way  that  comple¬ 
ments  rather  than  interferes  with  the  scheduling  activities. 

3.1.3  Transient  Processor  Overloads 

The  third  source  of  potential  priority  inversion  results  from  attempts  to  protect  important  tasks  from 
missing  deadlines  under  periods  of  transient  processor  overload.  Transient  processor  overload  can  be  caused 
by  stochastic  processing  time  or  exception  handling.  In  some  applications,  the  computation  time  of  a  task  is 
highly  data  dependent.  In  some  other  cases,  the  exception  handling  codes  are  large  but  invoked  so  infre¬ 
quently  that  it  is  not  wise  to  reserve  CPU  cycles  for  them  during  periods  of  normal  e,\ccution.  In  either  of  tlie 
two  situations,  in  order  to  provide  an  adequate  average  processor  utilization,  one  must  be  prepared  to  tolerate 
some  occasional  transient  processor  overload. 

There  is  die  tendency  for  programmers  to  assign  a  higher  priority  to  a  task  that  is  more  important  to  a 
mission  than  other  tasks.  This  tendency  is  especially  pronounced  when  dicrc  is  die  possiblility  of  transient 
priKCSsor  overloads.  However,  in  real  time  systems  it  is  crucial  to  meet  all  die  deadlines  whenever  possible. 
Assigning  higher  priority  to  a  more  important  task  without  analyzing  the  impact  on  deadlines  could  force  loss 


361 


important  tasks  to  miss  their  deadlines  unnecessarily.  For  example,  suppose  that  task  1  has  a  period  of  5  time 
units  and  an  execution  time  of  2  4nits,  while  task  2  has  a  period  of  10  units  and  execution  time  4  units.  Task  2 
is  also  known  to  be  more  important  tlian  task  1.  According  the  the  rate  monotonic  algorithm,  task  1  should  be 
given  higher  priority,  and  both  tasks  can  be  successfully  scheduled  using  this  algorithm.  If  task  2  were 
assigned  higher  priority,  than  task  1  would  be  forced  to  miss  its  deadline  unnecessarily. 

On  the  other  hand,  the  importance  of  die  task  becomes  crucial  when  the  processor  is  sufficiently  over¬ 
loaded,  so  that  all  the  deadlines  cannot  be  met.  In  an  overloaded  situation,  we  want  to  sacrifice  the  less 
important  tasks.  It  is  important  to  observe  that  the  simple  rate  monotonic  or  deadline  scheduling  algorithms 
do  not  behave  well  under  overloaded  conditions.  In  the  case  of  the  rate  monotonic  algorithm,  the  tasks  with 
longest  period  will  be  the  ones  to  miss  dicir  deadlines.  In  the  case  of  dynamic  deadline  scheduling  al¬ 
gorithms,  any  or  all  of  the  task  could  miss  its  deadline  during  an  overload  period. 

In  summary,  the  problem  of  transient  processor  overloads  is  a  serious  one.  The  practice  of  assigning  higher 
priority  to  a  more  important  task  tends  to  force  less  important  tasks  to  miss  their  deadlines  unnecessarily.  On 
the  other  hand,  simple  scheduling  algorithms  such  as  rate  monotonic  or  deadline  scheduling  algorithms 
cannot  guarantee  that  the  deadlines  of  important  tasks  will  be  met  during  an  overload  situation.  Nonetheless, 
schedulers  developed  for  real  time  operating  systems  should  ensure  that  the  deadlines  of  all  tasks  will  be  met 
when  the  system  is  not  overloaded.  Furthermore,  only  those  least  important  tasks’  deadlines  should  be 
missed  when  the  system  is  experiencing  an  overload. 


References 


[Liu  73]  Liu,  C.  L.  and  Layland,  J.  W. 

Scheduling  Algorithms  for  Multiprogramming  in  a  Hard  Real  Time  Environment 
JACM.Wi. 

[Mok  83]  Aloysius  Ka-Lau  Mok. 

Fundamental  Design  Problems  of  Distributed  Systems  for  the  Hard-Real-Time  Environment. 
PhD  thesis,  Massachusetts  Institute  of  Technology,  1983. 

[Myers  76]  Myers,  G. 

Software  Reliability. 

John  Wiley  &  Sons,  1976. 

[Rusch  77]  Ruschitzka  and  Fabry. 

A  Uni5'ing  Approach  to  Scheduling. 

C  ACM  200).  1977. 

[SEN  81]  Staff. 

News  Item. 

ACM  SIGSOFT  Software  Engineering  Notes  6(2),  1981. 


363 


4.6  Analysis  of  Processor  Scheduling  with  Overhead 


364 


Analysis  of  Processor  Scheduling  with  Overhead 

1  Introduction 

An  analysis  of  the  schedulability  of  periodic  tasks  with  hard  deadlines  was  presented  by  Liu  and 
Layland  [Liu  73].  This  analysis  indicates  that  any  such  task  set  can  be  scheduled  if  it  utilizes  no  more 
than  the  available  processor  cycles.  Unfortunately,  this  analysis  ignored  several  factors  that  are 
mportant  in  many  systems.  These  include  I/O  and  timer  interrupts  which  can  cause  priority  inver¬ 
sions,  the  overhead  of  the  scheduling  activity  itself  and  overhead  from  task  swapping.  Some  of  these 
can  be  regarded  as  a  part  of  the  periodic  tasks  and  so  can  be  incorporated  into  the  analysis.  Other 
overhead  aspects  cannot  be  so  regarded.  In  general,  we  must  develop  an  analysis  which  includes  a 
variety  of  overheads. 

2  A  Quantum  Formulation 

There  are  several  ways  to  deal  with  the  interrupt  problem.  First,  one  could  have  a  co  processor 
which  would  handle  the  scheduling  and  interrupts.  This  would  leave  only  task  swapping,  and  it  could 
be  incorporated  directly  into  the  scheduling  analysis.  A  second  approach  is  to  create  a  time  quantum 
during  which  a  task  can  be  executed  without  interruption.  In  this  approach,  some  hardware  queueing 
device  would  be  needed  to  queue  the  interrupts  arriving  in  the  middle  of  a  quantum.  We  will  analyze 
this  latter  approach  in  this  section  and  develop  queueing/co-processor  solutions  in  a  subsequent 
report.  We  must,  however,  determine  the  power  of  dynamic  deadline  scheduling  or  rate  monotonic 
when  overhead  is  present  and  a  quantum  approach  is  used.  There  is  an  important  trade-off  issue 
concerning  the  quantum  length.  On  the  one  hand,  a  long  quantum  allows  a  high  priority  task  to  be 
executed  without  interruption  and  the  attendant  priority  inversions.  On  the  other  hand,  a  long  quan¬ 
tum  allows  a  low  priority  task  to  execute  for  a  long  period  without  interruption  as  well.  This  is  likely  to 
cause  a  priority  inversion  as  higher  priority  tasks  are  forced  to  wait.  In  such  a  fixed  quantum  size 
situation,  one  can  determine  the  optimal  size  to  allow  maximum  utilizations.  We  will  see,  however,  a 
variable  quantum  length  is  feasible  and  has  a  greater  potential. 

3  A  Fixed  Quantum  Formulation 

The  processor  scheduling  problem  has  the  following  formulation.  There  are  n  periodic  tasks.  The  i*^ 
task  has  period  Tj  and  requires  C,  units  of  processing  time.  Let  Uj  *  Cj/T.  be  the  utilization  of  the  task 
and  V|  =  2y=iU|  be  the  cumulative  utilization  of  the  first  i  tasks.  We  assume  that  the  tasks  are 
ordered  so  that  T^  <  ...  <  T^.  We  assume  that  a  process  will  execute  for  a  period  of  time  q  or  until  it 


365 


completes,  whichever  comes  first.  At  the  end  of  this  period,  the  executive  takes  over  and  attends  to 
interrupts,  swaps  in  a  ready  process  of  higher  priority  if  one  is  available,  or  continues  the  current 
process  if  it  is  of  highest  priority  and  not  yet  complete.  The  time  for  any  of  these  activities  is  random 
depending  upon  the  activity  required.  Indeed,  the  overhead  may  well  depend  on  q  with  larger  values 
of  q  creating  larger  overhead  requirements.  In  this  analysis,  we  assume  that  the  overhead  is  inde¬ 
pendent  of  q  and  is  given  by  a  constant  H.  This  analysis  can  be  extended  to  allow  H  to  depend  on 
both  q  and  n,  the  number  of  tasks  being  handled.  This  would  require  an  exact  formula  describing  the 
functional  relationship.  Future  analysis  will  deal  with  such  generalization. 

A  task  is  divided  into  quanta  of  processing.  Each  quanta  generates  an  overhead  period  of  H.  This 
will  result  in  an  increase  in  processing  time  of  CH/q  and  an  increase  in  utilization  of  UH/q.  The 
inflated  total  processor  utilization  is  given  by  U(1  H/q).  We  assume  that  the  executive  time  required 

to  handle  I/O  interrupts  associated  with  this  task  is  included  in  the  processor  requirement  C.  Indeed, 
the  inflation  factor  (1  +  H/q)  is  only  approximate.  C  must  be  replaced  C  +  f^(C/q)H,  where  fj.)  is  the 
ceiling  function.  Consequently,  we  note 

C(1  -t-  H/q)  <  C  f  (C/q)H  <  C(1  -h  H/q)  +  H. 

This  suggests  using  large  q  in  order  to  minimize  the  inflation  of  the  processing  time  of  the  task. 

Using  the  quantum  formulation,  it  is  possible  to  analyze  the  worst  case  schedulability  of  the  tasks.  It 
is  possible  that  the  initialization  time  (and  hence  deadline)  of  a  task  can  be  controlled.  It  is,  however, 
more  likely  that  this  cannot  be  controlled.  Consequently,  the  initiation  time  of  a  high  priority  task  can 
occur  just  at  the  beginning  of  the  quantum  of  processing  assigned  to  a  lower  priority  task.  The  delay 
before  the  task  begins  processing  can  be  as  much  as  H  -»•  q.  This  would  argue  for  a  small  value  of  q  in 
order  to  minimize  the  starting  delay  to  any  process.  Therefore,  there  is  a  trade-off  to  be  made  in  the 
choice  of  q.  This  involves  balancing  the  additional  processing  time  arising  from  short  quanta  against 
the  priority  inversion  that  is  likely  to  occur  with  long  quanta.  We  seek  to  find  an  optimum  quantum 
size.  In  a  worst  case  formulation,  we  wish  to  find  a  task  set  having  minimum  utilization  whose  inflated 
processing  requirement  fully  utilizes  the  processor;  that  is,  if  any  of  the  processing  requirements  of 
any  the  tasks  is  increased,  then  some  deadline  is  missed.  We  then  wish  to  choose  q  to  maximize  this 
minimum  utilization. 

Consider  first  the  case  of  a  single  task,  n  =  1.  The  task  will  be  initiated  at  time  0,  have  deadline  T 
and  processing  requirement  C.  In  the  worst  case,  an  aperiodic  task  will  have  begun  a  quantum  of 
processing  just  before  0  and  will  process  for  the  length  of  a  quantum.  In  order  for  the  deadline  to  be 


366 


met,  we  must  have  q  +  H  +  C  +  f  (C/q)H  <  T.  Note  that  even  though  there  is  a  single  task,  it  must  be 
executed  in  a  quantum  fashion  and  incur  the  extra  overhead.  Consequently,  we  approximate  the 
situation  by 

q  +  H  +  C(1  +  H/q)  <  T,  or  U(1  +  H/q)  <  1  -  (q  +  H)/T,  or  U  <  q/(q  +  H)  -  q/T. 

We  wish  to  choose  q  in  order  to  make  the  bound  q/(q  +  H)  -  q/T  as  large  as  possible.  A 
straightforward  calculation  shows  that  the  optimal  q  is  given  by  q  =  -  H,  where  we  assume 

that  T  >  H.  The  resulting  bound  on  utilization  is  given  by  U  ^  (1  - 

If  U  satisfies  this  bound,  then  there  is  a  range  of  choices  of  quanta  which  can  be  used,  and  we  might 
pick  the  value  of  q  which  minimizes  overhead.  Thus  for  given  H,  T  and  U  satisfying  U  <  (1  — 
(H/T)  )  ,  we  might  pick  the  overhead  minimization  quantum  size, 

q  =  (T(l  -  U)  —  H  +  ((T(1  -  U)  -  H)^  —  4UTH)^'^^)/2.  Note  we  assume  U,  T  and  H  are  known  and 
that  the  scheduling  algorithm  can  use  these  quantities  to  establish  the  quantum  size. 

The  aoove  analysis  applies  to  either  a  rate  monotonic  fixed  priority  algorithm  or  a  deadline  driven 
algorithm.  We  now  extend  to  the  multiple  task  case.  The  rate  monotonic  formulation  is  the  easiest  to 

carry  out.  Suppose  that  we  are  given  n  tasks  and  a  critical  instant.  We  consider  subset  of  tasks 

consisting  of  the  first  i  tasks,  those  having  highest  priority.  Each  task  has  its  utilization  inflated  by  the 
factor  1  +  H/q,  and  this  subset  may  be  delayed  in  processing  H  +  q  units  due  to  a  lower  priority  task 
beginning  processing  just  before  the  critical  instant,  fn  the  Liu  and  Layland  formulation,  there  is  a 
bound  on  the  worst  case  rate  monotonic  utilization  given  by  i(2’^'  -  1)  =  Rj.  In  order  to  create  the 
worst  case  full  utilization  of  the  critical  zone,  we  can  reduce  the  processing  time  of  the  i*^  task  by  H  + 
q  units.  This  means  that  the  worst  case  will  be  given  by  Cj  =  (T^^  ^  -T.)q/(q  +  H),  1  ^  j  i  -  1  and  C, 
=  (2T^  -  Tj  -  (q  +  H))q/(q  +  H)  where  we  now  assume  C.  >  0.  This  gives 

V,  =  -  Tj)/Tj  +  (2T^  -  T.  -  (q  +  H))/T.]q/(q  +  H). 

We  wish  to  find  the  T’s  which  minimize  the  above. 

By  repeating  the  Liu  and  Layland  argument,  we  find 

'^i  <  R,q/(q  +  H)  •  q/Tj,  1  <  i  <  n. 

This  s,ays  that  for  a  quantum  rate  monotonic  scheduling  algoritf  n,  the  worst  case  bounds  on  a  set 
ot  1  tasks  are  given  by 


367 


=  R.q/(q  +  H)  -  q/T,,  1  <  i  <  n. 

This  set  of  criteria  illustrate  the  quantum  trade-off.  We  wish  to  choose  q  to  make  R|q/(q  +  H)  •  q/T. 
as  large  as  possible.  Increasing  q  tends  to  make  Rjq/(q  +  H)  near  the  Liu  and  Layland  rate  monotonic 
bound  of  R..  Unfortunately,  increasing  q  increasing  the  quantum  loss  factor  q/T.  which  serves  to 
reduce  the  bound. 

To  better  understand  the  situation,  we  rewrite  the  bound  as  R.X/(X-t-  F)  —  X,  where  X  =  q/T.  and  F 
=  H/T,.  We  assume  that  X  >  F.  It  follows  that  the  optimal  choice  for  X  is  given  by  X  =  (R.F)^'^^  - 
F.  The  resulting  bound  on  V.  is  given  by  (  —  (F)’^^  )^.  It  follows  that  all  deadlines  for  the  i**’ 

class  will  be  made  if  q  is  taken  to  be  T.((R.F)''’^  —  F),  and  the  cumulative  utilization  is  taken  to  be  no 
greater  than  ((Rj^'^^  -  (F)^^^)^.  If  all  deadlines  are  to  be  met,  then  a  quantum  q  must  be  found  which 
simultaneously  satisfies  V.  <  R.q/(q  +  H)  -  q/T..  This  cannot  be  pursuAi  until  a  specific  value  for  H 
and  typical  values  for  T.  are  introduced.  After  this  is  done,  one  can  determine  specific  bounds  on 
processor  schedulability.  Furthermore,  one  should  use  an  H  which  depends  on  the  number  of  tasks 
being  processed. 

The  quantum  formulation  for  deadline  scheduling  is  similar.  We  outline  the  mathematical  details. 
The  utilization  of  the  task  set,  V^,  is  inflated  by  the  factor  (q  +  H)/q,  arising  from  the  overhead. 
Furthermore,  there  is  a  potential  priority  inversion  factor  q  H  which  arises  when  a  higher  priority 
task  is  held  out  during  the  course  of  a  quantum  of  lower  priority  processing.  This  serves  to  reduce  the 
period  by  q  +  H.  As  a  consequence,  we  require  in  the  worst  case,  <  q/(q  H)  -  q/T^,  where  T.| 
is  the  shortest  period.  One  can  optimize  on  q  as  before  to  find  q  =  (HT^)’''^  -  H  and  =  (1  - 
(HT.j)^''^)^.  This  result  is  equivalent  to  rate  monotonic  result;  however,  R^  is  replaced  by  1. 

The  above  analysis  is  somewhat  approximate.  We  have  assume  that  each  task  requires  an  integer 
number  of  quanta  and  that  H  is  independent  of  the  task  set.  Each  of  these  assumptions  can  be 
modified.  As  we  continue  to  study  scheduling,  we  are  able  to  introduce  specific  assumptions  which 
will  alio  7  us  to  derive  sharper  bounds  on  processor  schedulability.  In  addition,  we  mention  below  the 
possibility  of  a  variable  quantum  formulation  which  could  substantially  increase  the  schedulability. 

4  Variable  Quantum  Formulation 

The  previous  section  presented  a  worst  case  analysis  for  scheduling  with  hard  deadlines  when 
overhead  is  present  using  a  quantum  formulation.  The  bounds  can  be  substantially  below  those  of 
the  no  overhead  case.  The  rigidity  of  the  fixed  quantum  formulation  can  lead  to  serious  inefficiencies 


368 


in  scheduling.  In  this  section,  we  introduce  the  idea  of  a  variable  length  quantum.  This  approach 
seems  to  have  much  better  scheduling  power.  A  formal  analysis  will  be  carried  out  as  a  future  task. 

A  variable  quantum  algorithm  would  operate  in  the  following  manner.  For  each  task,  the  next 
initiation  time,  the  next  deadline,  the  task  period  and  the  amount  of  computation  required  would  be 
kept  in  a  database.  At  a  decision  point,  the  scheduling  algorithm  considers  all  ready  tasks  and  will 
execute  the  one  with  the  closest  deadline.  In  addition,  the  scheduler  will  set  a  timer  to  indicate  the 
time  at  which  the  next  scheduling  decision  must  be  made.  This  time  will  be  the  end  of  the  current 
procession  requirement  or  at  the  initiation  time  of  a  task  which  is  not  yet  ready  but  will  become  higher 
priority  when  it  is  ready.  The  current  task  is  then  allowed  to  execute  until  the  timer  signals  a  change  is 
needed.  This  algorithm  creates  a  deadline  scheduler.  It  does  generate  some  overhead,  namely  the 
overhead  to  carry  out  the  computations  themselves  as  well  as  the  task  swapping  times.  This  over¬ 
head  will,  however,  be  considerably  less  than  the  overhead  created  by  the  fixed  quantum  scheduler. 
The  great  benefit  occurs  because  a  task  need  not  be  interrupted  just  to  enforce  a  fixed  quantum 
algorithm.  Rather,  any  task  can  execute  continuously  if  there  are  no  required  scheduling  changes. 
For  example,  if  we  have  n  single  task,  then  for  the  deadline  to  be  met,  we  need  C-fH<TorLI:^ 
1-(H/T),  a  bound  better  than  that  given  earlier  for  the  fixed  quantum  formulation. 

It  may  be  desirable  to  limit  the  ^ngth  of  the  processing  times  on  a  single  task,  for  example  to 
provide  timely  handling  of  queued  interrupts.  This  and  other  generalizations  are  the  subject  of  future 
work. 


369 


iLIu  73] 


References 

Liu,  C.  L.  and  Layland,  J.  W. 

Scheduling  Algorithms  for  Multiprogramming  in  a  Hard  Real  Time  Environment. 
JACM ,  1973. 


370 


4.7  Scheduling  Hard  Deadline  Periodic  Tasks  with  I/O 


371 


Scheduling  Hard  Deadline 
Periodic  Tasks  with  I/O 


E.  Douglas  Jensen,  John  P.  Lehoczky 


Lui  Sha,  Samuel  Shipman  and  Martin  McKendry 


372 


1 .  Introduction 


1 .1  Background 

The  current  practice  in  designing  systems  for  a  real-time  environment  is  rather  ad  hoc.  Performance 
evaluation  is  accomplished  either  by  simulation  or  by  actual  measurements  on  prototype  systems, 
and  system  performance  is  improved  by  "fine  tuning"  certain  system  parameters.  This  hand-crafted 
scheduling  approach  often  yields  systems  that  are  very  difficult  to  understand,  hard  to  maintain  and 
hard  to  upgrade.  Nevertheless,  this  approach  has  been  taken  in  part  because  the  theoretical  foun¬ 
dation  necessary  to  do  better  have  not  been  established.  The  purpose  of  this  work  is  to  advance  the 
understanding  of  task  scheduling  beyond  the  current  uniprocessor  stage  to  the  consideration  of 
network  scheduling.  It  is  intended  to  provide  a  foundation  upon  which  high  performance  schedulers 
can  be  developed.  This  is  an  ambitious  goal,  especially  for  a  distributed  system.  Such  systems  are 
composed  of  many  processors,  running  many  tasks  distributed  across  those  processors  and  are 
connected  by  a  communication  network.  The  scheduling  involves  meeting  task  timing  requirements, 
and  this  entails  scheduling  the  processors  and  the  communication  network.  One  must  also  integrate 
these  two  scheduling  activities. 

There  is  some  literature  on  scheduling  real-time  tasks;  however,  all  of  those  papers  are  narrowly 
focused  on  the  scheduling  of  tasks  in  processors.  None  makes  any  mention  of  the  importance  of 
considering  network  scheduling.  Consequently,  these  papers  do  not  recognize  that  tasks  do  not  run 
in  isolation  but  receive  data  as  input  from  other  tasks  (possibly  over  the  network)  and  output  data  to 
other  tasks  (also  possibly  over  the  network).  This  suggests  that  we  must  begin  with  a  broader  view  of 
the  scheduling  problem.  Rather  than  focusing  narrowly  on  the  single  programs  resident  in  a  proces¬ 
sor,  we  must  consider  the  input  and  output  aspects  of  those  tasks  as  well.  This  report  documents  our 
initial  research  on  the  network  scheduling  problem. 

We  begin  with  a  short  introduction  to  the  literature  on  the  processor  scheduling  of  periodic 
processes  with  hard  deadlines.  This  is  followed  by  a  description  of  the  research  project.  We  follow 
this  with  a  description  of  the  emulator  and  simulator  software  that  was  built  as  tools  to  facilitate  our 
study  of  network  scheduling.  The  source  code  for  the  simulator  and  the  emulator  is  supplied  in  a 
separate  binder. 


We  next  present  the  results  of  this  year’s  research  and  its  ititplicaiions  for  (Jisiribuiecl  system 


schedulers. 


373 


1.1.1  Scheduling  Processors 

There  is  some  research  literature  on  scheduling  periodic  tasks  in  processors.  Liu  and 
Layland[Liu73]  considered  the  case  of  scheduling  a  single  processor.  Under  the  assumption  that  the 
deadline  for  a  task  is  the  same  as  the  periodicity  of  the  task,  the  optimal  fixed  priority  and  dynamic 
priority  scheduling  algorithms  were  determined.  These  optimal  algorithms  were  shown  to  be  the  rate 
monotonic  and  dynamic  deadline  scheduling  algorithms  respectively.  In  this  case,  the  word 
''optimal"  means  that  if  a  task  set  can  be  scheduled  by  any  fixed  (dynamic)  priority  algorithm,  then  it 
can  also  be  scheduled  by  the  rate  monotonic  (dynamic  deadline)  scheduling  algorithm.  Liu  and 
Layland  derived  the  worst  case  scheduling  bounds  for  these  two  algorithms.  The  worst  case  bound 
for  the  rate  monotonic  algorithm  were  shown  to  be  In  2  =  0.693.  This  means  that  any  set  of  periodic 
tasks  having  processor  utilization  less  than  0.693  can  be  scheduled  by  the  rate  monotonic  algorithm 
so  that  no  deadlines  are  missed.  In  fact,  this  algorithm  assigns  priorities  in  inverse  order  of  the 
periods.  The  worst  case  bound  for  the  dynamic  deadline  algorithm  was  found  to  be  1 .0.  This  means 
that  the  dynamic  deadline  algorithm  (which  gives  highest  priority  to  the  task  having  the  nearest 
deadline)  can  schedule  any  set  of  periodic  tasks  which  do  not  over-utilize  the  processor's  capacity. 

This  worK  has  been  extended  in  a  variety  of  ways.  For  example,  Leung  and  Merrill  [Leung  80]  and 
Lawler  [Lawler  81]  considered  deadline  scheduling  on  multiple  processors,  while  Martel  [Martel 
82]  extended  Lawler's  work  to  introduce  release  times  and  due  times.  Mok  [Mok  83]  proved  that  the 
dynamic  deadline  scheduling  algorithm  remains  optimal  with  respect  to  a  mixed  set  of  periodic  and 
sporadic  tasks  with  hard  deadlines.  A  sporadic  task  has  a  random  arrival  time,  a  computation  time 
and  a  hard  deadline.  Mok  assumed  a  minimum  separation  time  between  two  consecutive  arrivals  of 
sporadic  tasks.  Mok  showed  that  the  least  slack  time  algorithm  is  also  optimal.  The  slack  time  is  the 
difference  between  the  task  deadline  and  the  end  of  the  task  computation  time. 


1.1.2  I/O  Scheduling 

In  contrast  to  the  well  rather  developed  theory  of  scheduling  periodic  tasks  in  processors,  little  if 
any  work  has  appeared  on  the  important  problem  of  integrating  the  scheduling  of  tasks  in  processors 
with  their  data  I/O.  Given  that  distributed  command  and  control  systems  are  characterized  by  data 
flows  from  node  to  node  with  h  ird  real-time  constraints,  it  is  imperative  to  develop  a  theory  and  to 
provide  algorithms  for  hard  real-time  task  scheduling  with  I/O. 


The  joint  processor  and  I/O  scheduling  problem  is  especially  difficult.  In  order  to  properly  address 
it,  we  have  taken  a  three  pronged  approach.  This  entails 

•  Continuing  to  extenri  the  Liu  and  l.ayl.ind  theory  to  this  more  general  ''.efting, 
o  Ds'/olopinvi  n  systfin  .■riVilator  v/itli  '-vhich  we  o.nn  perform  ncliMi  i  to 


374 


confirm  the  theory  and  observe  the  impact  of  effects  which  are  typically  ignored  in 
analytic  studies,  and 

•  Developing  a  system  simulator  which  will  allow  us  to  perform  a  large  number  of  test  runs 
against  which  to  compare  the  emulator  and  the  theory. 

The  three  approaches  offer  three  distinct  but  related  ways  to  study  the  I/O  scheduling  problem. 
The  analytic  approach  has  the  potential  of  giving  general  expressions  for  scheduling  bounds  over  a 
wide  range  of  cases;  however,  it  necessarily  entails  making  simplifying  assumptions  about  the  system. 
The  simulator  offers  the  possibility  of  a  more  realistic  system  model  against  which  to  compare  the 
accuracy  of  the  theory.  The  emulator  offers  the  most  realistic  system  view  and  by  its  nature  incor¬ 
porates  system  effects  that  are  not  included  in  the  either  the  analytic  models  or  the  simulator.  On  the 
other  hand,  the  results  of  the  emulator  may  be  influenced  by  the  specific  hardware  and  software 
characteristics.  By  having  these  three  approaches,  we  can  better  evaluate  the  importance  of  specific 
system  effects  while  producing  theoretical  results  at  the  same  time. 

1 .2  Overview  of  Progress 

In  this  section,  we  briefly  summarize  the  progress  to  date. 


1.2.1  Progress  of  Theory  Approach 

It  was  mentioned  in  the  introduction  that  the  original  basis  for  scheduling  periodic  tasks  with  hard 
deadlines  in  processors  was  developed  by  Liu  and  Layland  [Liu  73].  This  paper  contains  the  results 
mentioned  earlier  on  the  optimality  of  and  worst  case  bounds  for  the  rate  monotonic  and  dynamic 
deadline  scheduling  algorithms.  In  fact,  this  paper  also  contains  two  other  results  of  even  greater 
importance,  because  they  form  the  entire  basis  for  the  worst  case  analysis.  These  results  can  be 
called  the  "hvperperiod  result"  and  the  "critical  zone"  result.  The  hyperperiod  result  offers  the 
observation  that  for  periodic  processes,  the  pattern  of  initiation  of  each  task  is  periodic.  If  one 
considers  the  least  common  multiple  of  all  the  periods  (the  hyporperiod),  then  the  pattern  of  task 
initiations  is  repeated  identically  from  one  hyperperiod  to  the  next.  This  leads  one  to  the  conclusion 
that  one  needs  only  to  analyze  tiie  timing  constraints  over  a  siiiglo  liyperpcriod  to  determine  if  a 
particular  algorithm  can  meet  all  the  deadlines  infinitely  far  in  the  fuiure. 


The  "critical  zone"  result  was  used  by  Liu  and  Layland  to  determine  if  a  fixed  priority  algorithm 
(sp'^cifT,  sly  the  rate  monotonic  algorithm)  could  meet  all  the  deaciliiioa.  It  indicate '  tl',at  to  . .  rnino 

if  a  tar.k  '.vill  meet  all  its  (ioadlines,  on.'  need  only  consider  tho  first  d--adiific  ia.ii  lor.k  for  the  v, or.ee 
pha^.ing  of  t.ioks.  It  clijadlin:;  is  met  when  all  l.-.:-.ks  arf'  inr.i.'Ot.J  at  a  i.omi.'.'  ii  time  0,  -u-h  all 


of  tliir;  '*’'11  he  mf't. 


375 


It  is  important  to  note  that  both  of  these  fundamental  results  no  longer  hold  when  one  extends 
attention  to  the  I/O  scheduling  case.  The  problem  is  especially  acute  when  one  allows  for  the 
postponement  of  deadlines,  an  important  consideration  which  is  realistic  when  one  has  additional 
buffers  to  store  tasks.  One  of  the  initial  findings  was  that  one  may  have  to  consider  many  hyper¬ 
periods  before  a  firm  conclusion  could  be  drawn  about  whether  all  deadlines  would  be  met  or  some 
would  be  missed  in  the  future.  Furthermore,  the  ability  of  a  task  to  meet  the  initial  deadline  did  not 
allow  one  to  conclude  that  it  would  meet  ail  future  deadlines,  even  in  the  static  priority  case  when 
either  I/O  scheduling  or  deadline  postponement  is  used.  These  important  findings  indicated  that  we 
needed  to  essentially  abandon  the  Liu  and  Layland  theory  and  develop  a  new  one.  The  hyperperiod 
result  was  especially  unsettling,  because  it  had  dramatic  impact  on  the  simulator  and  emulator. 
Simply  stated,  one  generates  a  task  set  or  the  simulator  or  emulator,  selects  a  scheduling  algorithm 
and  runs  the  tasks  noting  whether  deadlines  are  made  or  missed,  it  is  critical  to  know  how  long  either 
must  be  run  before  a  conclusion  can  be  drawn.  Clearly,  the  experiment  can  be  stopped  when  a 
deadline  is  missed,  but  what  if  no  deadline  is  missed?  How  long  must  one  continue  the  run?  It 
became  important  to  settle  this  issue.  Of  special  note  was  the  development  of  a  criterion  to  determine 
whether  deadlines  would  be  missed  at  some  time  in  the  future  based  on  the  behavior  during  an  initial 
time  segment.  The  theoretical  research  is  discussed  in  greater  detail  in  Chapter  2. 

1 .2.2  Progress  on  the  Simulator 

A  simulator  was  built  to  allow  for  the  study  of  factors  usually  ignored  in  analytic  models  and  for  the 
convenient  study  of  a  large  number  of  cases.  The  simulator  was  designed  to  identify  a  threshold  at 
which  some  task  deadline  would  be  missed.  One  begins  with  an  initial  task  set  defined  in  terms  of  the 
number  of  tasks,  the  period  of  each,  and  relative  values  for  input  DMA,  proceoGor  computation  and 
output  DMA  times.  The  simulator  systematically  scales  up  the  utilizations  until  a  task  set  is  reached 
which  cannot  bo  scheduled  by  the  particular  algorithms  being  used.  The  simulator  is  able  to  handle 
all  the  scheduling  algorithms  described  earlier.  Furthermore,  it  allows  for  postponement  of  deadlines. 
The  simulator  also  has  a  useful  plotting  and  analysis  facility  to  aid  the  user's  understanding  of  the 
complex  interactions  that  arise  in  I/O  scheduling.  The  simulator  is  described  in  greater  detail  in 
Chapter  3. 

1 .2.3  Progress  on  the  Emulator 

An  emulator  was  built  to  allow  for  the  study  of  U»e  behavior  of  the  scheduling  model  described  in 
Cliapter  2  under  realistic  conditions  in  an  actual  computer  system.  It  allov/s  totally  automated  perfor- 
mant.e  of  sets  of  experiments,  and  provide.s  complete  traces  of  the  significant  events  that  occur 
dui  ing  the  eyperiment  which,  wfien  <analy7(.;il  by  antomated  .analysis  luol.s,  yield  precise  liminga  of  the 
Of  a  i*:onc  that  r>:!ormi,ie  ttio  results  of  the  scheduling  algorithms  in  use.  The  'tructiiro  of  the 


376 


emulator  allows  different  scheduling  policies  to  be  added  easily.  The  emulator  is  described  in  detail 
in  Chapter  4.  Initial  implementations  of  the  rate  monotonic  and  deadiine  scheduiing  aigorithms  exist. 
However,  more  extensive  vaiidation  will  be  conducted  next  year.  Some  emulator  results  have  already 
been  used  to  demonstrate  scheduling  phenomena  that  occur  in  real  systems,  for  example,  priority 
inversion. 


377 


2.  Theoretical  Investigation 

2.1  The  Model 

As  we  discussed  in  the  introduction,  this  project  utiiizes  a  three  pronged  approach  to  the  i/O 
scheduiing  probiem.  The  accomplishments  during  this  initial  phase  (April,  1985  •  December,  1985) 
were  of  two  types.  Rrst,  we  have  made  major  developments  in  each  of  the  three  separate  approach 
areas.  These  will  be  documented  shortly.  Second,  we  have  made  use  of  the  software  tools  to  run  a 
pilot  experiment  on  I/O  scheduling.  The  conclusions  drawn  from  this  pilot  study,  while  quite  prelimi¬ 
nary,  are  also  quite  clear.  They  lead  us  to  see  the  vital  importance  of  i/O  scheduling.  We  will  offer 
more  specific  conclusions  shortly. 

The  initial  phase  of  the  research  was  devoted  to  the  development  of  an  appropriate  model  with 
which  to  consider  I/O  scheduling.  This  depends  critically  on  the  specific  hardware  architecture 
being  considered.  We  selected  a  simple  model  which  ignored  a  number  of  potential  problems  leaving 
enhancements  to  future  efforts.  The  model  has  three  parts: 

1 .  Input  DMA 

2.  Computation 

3.  Output  DMA 

We  assume  that  a  task  has  a  period,  and  that  the  deadline  is  at  the  end  of  the  period.  This  deadline 
can  be  extended  by  some  integer  number  of  periods  if  we  consider  the  postponed  deadline  model. 
The  postponement  corresponds  to  the  addition  of  buffers.  In  this  year’s  work,  we  considered  only  two 
“"ases:  no  postponement  and  a  postponement  of  one  period.  From  the  initiation  time  to  the  deadline, 
utree  things  must  be  accomplished  sequentially:  the  data  associated  with  the  task  must  be  brought  in 
by  DMA,  the  computation  must  be  carried  out,  and  the  data  from  the  computation  must  be  output  by 
DMA.  These  three  subtasks  are  treated  as  sequential  and  noninterfering  activities.  It  is  assumed  that 
there  is  one  input  channel  through  which  data  can  be  brought  into  the  processor  and  one  channel 
through  which  data  can  be  taken  out.  Furthermore  these  are  assumed  to  be  independent  channels 
and  do  not  steal  any  CPU  cycles.  This  provides  a  good  starting  model  and  is  a  reasonable  represen¬ 
tation  of  some  architectures.  From  the  scheduling  point  of  view,  there  are  three  places  where 
scheduling  can  take  place:  One  can  schedule  the  input  DMA  phase,  giving  priority  to  vdiatever 
transfer  is  ready  to  bo  carried  out,  and  one  can  similarly  .scl'.cdule  the  output  D‘v1.A  phase.  Finally, 
Iher*?  is  a  processor  scheduling  phase  in  which  the  proctscor  can  give  priority  to  any  ta^.k  wliich  is 
currently  ready  to  run. 


378 


2^1.1  I/O  Scheduling  Model 

In  this  study,  we  assume  that  each  periodic  task,  Tj,  has  a  fixed  period  P,^.  Associated  with  each 
period,  there  are  data  needed  input  for  the  computation,  Ij,  the  computation,  C,,  and  the  data  output 
from  the  computation,  O,.  In  other  words,  the  scheduling  of  a  task  includes  three  phases:  to  schedule 
the  DMA  process  which  transfers  the  data  from  an  input  channel  to  the  main  memory,  to  schedule  the 
CPU  to  perform  the  computation  and  to  schedule  the  output  DMA  process  which  transfers  the  results 
of  a  computation  to  an  output  device.  The  deadline  of  a  task  initiated  at  period  i  can  be  designated  as 
the  beginning  of  next  period,  (i  +  1).  In  this  case,  within  the  period,  (i,  i  +  1),  a  task  must  complete  all 
three  phases  of  activities:  Ij,  C,  and  O,.  This  type  of  deadline  designation  represent  certain  demanding 
tasks  where  delays  must  be  minimized  as  much  as  possible.  On  the  other  hand,  the  deadline  of  a  task 
initiated  at  period  i  can  be  postponed  to  the  beginning  of  period  (i  +  2)  or  even  later.  By  postponing 
deadlines,  it  becomes  possible  to  input  the  data  at  period  i  and  perform  the  computation  and  data 
output  at  period  (i+  1)  or  later.  Deadline  postponement  improves  the  schedulability  at  the  cost  of 
increased  buffer  requirement  and  phase  delay.  If  we  postpone  the  deadline  to  infinity,  then  the 
scheduling  problem  becomes  a  three  stage  queueing  network  with  infinite  buffers  space. 

In  thi-s  study,  we  assume  that  for  each  task  T.,  the  period  P.,  the  input  data  I.,  the  computation  Cj  and 
the  output  data  0|  are  deterministic.  In  addition,  we  assume  that  the  effect  of  ’’cycle  stealing"  during 
DMA  is  regiigible  and  there  is  no  contention  between  input  channels  and  output  channels.  In  other 
words,  we  assume  that  we  have  high  performance  computers  which  have  dual- port  memory  banks 
arsd  separate  input  and  output  channel  controllers.  These  assumptions  will  be  relaxed  in  the  next 
year’s  study. 

2.1.2  Scheduling  Algorithms 

We  next  discuss  the  selection  of  scheduling  algorithms.  There  are  fundamentally  tv/c  major  classes 
of  algorithms  and  subclasses  within.  The  two  major  classes  are 

1.  The  algorithms  which  treat  each  of  the  three  phases  of  scheduiing  independently,  and 

2.  The  algorithms  which  integrate  all  three  scheduling  activities  together. 

2. 1.2.1  Nonintegratcd  Algorithms 

The  first  class  of  algorithms  ossentially  generates  a  separate  schedriling  algcrihun  for  each  of  the 
three  pha.si,‘S.  We  have  chosen  to  consider  three  possible  choices  for  each  of  these  single  phase 
schedulers: 

1.  FIFO, 

2.  n  uc  n;.  i ,t  ‘,on;r,  ;mm I 


379 


3.  Deadline. 

The  FIFO  algorithm  is  a  model  of  no  scheduling  algorithm  at  all.  It  merely  schedules  the  tasks  in  the 
arrival  order.  This  algorithm  will  be  seen  to  be  very  unstable.  It  is  possible  that  the  tasks  will  arrive  In 
the  order  which  is  consistent  with  a  good  scheduling  algorithm  and  so  be  scheduled  correctly  by 
FIFO.  It  is  equally  possible  that  the  tasks  will  arrive  in  an  order  which  will  lead  to  poor  scheduling 
decisions.  Consequently,  we  will  see  that  FIFO  will  have  a  wide  range  of  performance,  but  is  typically 
inferior  to  an  algorithm  which  is  designed  to  meet  deadlines  at  high  utilization  levels. 

The  rate  monotonic  algorithm  was  shown  by  Liu  amd  Layland  to  be  the  optimal  fixed  priority  algo¬ 
rithm.  It  is  very  easy  to  implement  in  the  processor,  and  can  be  implemented  for  input  and  output 
DMA  processes  with  some  modifications  to  the  channel  controllers.  Specifically,  the  sclieduler  gives 
higher  priority  to  task  i  over  task  j  if  task  i  has  a  shorter  period  that  task  j.  We  will  see  that  its  worst 
case  utilization  bound  (0.693)  is  a  large  underestimate  of  its  typical  potential. 

The  dynamic  deadline  algorithm  was  shown  by  Liu  and  Layland  to  be  the  optimal  dynamic  priority 
algorithm.  Specifically  this  algorithm  gives  the  highest  priority  to  the  task  having  the  nearest  dead¬ 
line.  This  can  be  separately  implemented  at  each  of  the  three  stages  of  scheduling. 

2. 1.2. 2  Integrated  Algorithms 

This  study  included  one  algorithm  from  the  second  group  consisting  of  the  algorithms  which  in¬ 
tegrate  all  three  stages  of  scheduling.  We  considered  the  propagated  deadline  algorithm.  This 
algorithm  works  in  the  following  way.  Suppose  a  task  T^  has  period  and  requires  I.,  units  of  DMA 
input  time,  units  of  processor  computation  time,  and  units  of  DMA  output  time.  Supposed  that 
this  task  is  initiated  at  lime  t  =  0,  At  the  first  period,  the  propagated  deadline  algorithm  treats  this  task 
as  having  a  deadline  given  by  P^-0,-C^  for  purposes  of  DMA  input  scheduling,  a  deadline  of 
P^  — O,  for  purposes  of  processor  scheduling  ,  and  a  deadline  P^  for  purposes  of  DMA  output 
scheduling.  This  algorithm  is  similar  to  the  dynamic  deadline  algorithm  applied  at  e.ach  stage,  but  it 
takes  later  computation  requirements  into  account  in  earlier  scheduling  decisions.  This  algorithm  is 
an  improvement  over  three  independent  scheduling  algorithms,  but  we  will  later  see  that  the  increase 
in  complexity  may  not  be  worth  the  slight  increase  in  scheduling  potential. 


2.2  Methodology 


2.2.1  Algorithm  Selection 

We  briefly  discussed  the  selection  of  candidate  algorithms  in  Chapter  2.  There,  the  distinction  was 
made  between  algorithms  which  treated  each  of  the  three  phases  of  scheduling  in  isolation  from  the 
others  and  algorithms  which  used  an  integrated  view.  We  chose  four  representatives  from  the  first 
category  (FFF,  FRF,  RRR,  and  ODD)  and  one  (PPP)  from  the  second  category.  Within  the  first 
category,  the  four  were  chosen  to  give  representation  to  algorithms  which  exhibit  a  lack  of  a  sys¬ 
tematic  use  of  scheduling  theory  and  to  those  which  make  use  of  concepts  of  real-time  scheduling. 
The  latter  consist  only  of  simple  and  effective  linear  algorithms,  since  complex  algorithms  are  self- 
defeating  in  the  context  of  hard  real-time  scheduling. 

In  selecting  an  unsophisticated  algorithm,  we  need  a  model  for  the  practice  of  I/O  scheduling 
commonly  seem  'n  many  systems.  In  most  existing  channel  processors,  the  priority  of  a  channel, 
once  assigned,  remains  unchanged  during  run-time.  In  addition,  all  data  I/O  to  or  from  a  channel 
contends  at  the  channel's  priority,  independent  of  the  nature  of  the  tasks.  This  practice  can  create 
great  difficulties  for  distributed  systems.  For  example,  the  system  database  is  shared  by  many  parts  of 
the  various  subsystems.  However,  all  the  data  provided  by  the  database  system  must  contend  at  the 
same  priority  assigned  to  the  disk  channel,  independent  of  the  urgency  of  the  requests.  The  general 
practice  of  assigning  channel  processor  priorities  is  to  do  it  according  to  the  average  speed  of  the 
devices  it  connects.  In  addition,  once  a  channel  starts  a  DMA  operation,  it  is  not  preemptable.  This 
type  of  scheduling  policy  is  approximated  by  the  FIFO  scheduling  algorithm. 

Since  most  existing  computers  are  able  to  support  the  rate  monotonic  scheduling  algorithm  for  CPU 
scheduling,  the  set  of  algorithms:  FIFO  for  input  DMA,  Rate  Monotonic  for  CPU  and  FIFO  for  output 
DMA  (FRF  algorithm)  is  designed  to  measure  the  effect  of  using  good  CPU  scheduling  algorithm  but  a 
poor  I/O  scheduling  algorithms.  The  three  stage  FIFO  (FFF)  is  selected  to  model  the  effect  of  a 
complete  lack  of  sound  real  time  scheduling  algorithms.  Since  the  rate  monotonic  scheduling  algo¬ 
rithm  and  the  deadline  algorithm  represent  tho  optimal  static  and  dynamic  scheduling  algorithms  for 
.single  stage  scheduling,  it  is  logical  to  use  them  at  each  stage  of  scheduling.  Thus,  two  candidates  for 
task  scheduling  with  I/O  are  the  three  stage  Rate  Monotonic  algorithm  (RRR)  and  the  three  stage 
Dynamic  Deadline  scheduling  algorithm  (DDD). 

In  this  study,  we  introduce  a  refined  version  of  the  earliest  deadline  scheduling  algorithm  called  the 
propagated  deadline  scheduling  algorithm.  This  algorithm  was  described  in  detail  earlier  and  is 
designed  to  provide  an  integrated  approach  to  scheduling. 


381 


2.2.2  The  Halting  Theorem 

In  either  the  simulator  or  the  emulator,  a  very  important  problem  is  the  halting  problem.  That  is, 
when  a  deadline  is  missed  for  a  particular  task  set.  it  is  a  certain  event.  However,  when  no  deadline  is 
missed  by  a  given  time  in  the  simulation/emulation  process,  it  is  unclear  if  all  the  deadlines  will 
continue  to  be  met  in  the  future.  In  a  pure  processor  scheduling  formulation,  one  needs  to  trace  no 
more  than  a  single  hyperperiod  (the  least  common  multiples  of  all  the  periods).  However,  this  result 
no  longer  holds  when  either  I/O  scheduling  or  deadline  postponement  is  introduced.  Consider  the 
case  of  pure  processor  scheduling  with  a  single  deadline  postponement.  Suppose  that  tasks  T^  and 
Tj  have  periods  5  and  15  and  that  the  processor  utilization  of  the  two  tasks  are  1/5  and  13/15 
respectively.  In  addition,  let  the  scheduling  algorithm  be  the  rate  monotonic  scheduling  algorithm.  In 
this  case,  the  hyperperiod  is  15  and  the  total  utilization  is  16/15  >  1.  It  is  obvious  as  the  deadline  will 
be  missed,  because  there  is  1  unit  of  extra  work  for  each  hyperperiod.  However,  no  deadline  is 
missed  until  15  hyperperiods  have  passed.  Since  the  number  of  hyperperiods  needed  to  detect  a 
missed  deadline  could  be  arbitrary  targe,  it  is  important  to  determine  criteria  to  determine  whether  all 
deadlines  will  be  met  without  carrying  on  ^e  simulation/emulation  process  indefinitely  into  the  fu¬ 
ture. 

Given  that  no  deadline  has  been  missed,  a  set  of  sufficient  conditions  for  stopping  the 
simulation/emulation  is  the  following  theorem: 

The  Halting  Theorem:  The  deadlines  for  a  given  set  of  periodic  tasks  under  a  given  scheduling 
algorithm  will  always  be  met,  if 

•  No  deadline  is  missed  within  the  first  two  hyperperiods: 

•  The  total  amount  of  input  DMA  (computation,  output  DMA)  carrying  over  from  the  first 
hyperperiod  into  the  second  hyperperiod  is  less  than  or  equal  to  that  from  the  second 
hyperperiod  into  the  third  hyperperiod. 

•  The  first  idle  time  slot  for  input  DMA(computation,  output  DMA)  in  the  third  hyperperiod 
is,  if  it  exists  at  all,  no  later  than  that  in  the  second  hyperperiod. 

We  outline  the  reasoning  which  makes  these  .sufficient  conditions.  The  fir.st  condition  is  obvious. 
The  second  condition  states  that  work  must  not  keep  piling  up  from  one  hyp-rperiod  to  the  next 
hyperperiod  for  either  input  DMA,  computation  or  output  DMA.  The  third  condition  requires  some 
explanation.  Starting  from  the  beginning  of  each  hyperperiod,  the  arr  ival  of  tasks  is  identical  owing  to 
the  periodicih/  of  tasks,  l-lowever,  tlie  carryover  from  one  hyprrrpe.dod  to  the  next  may  not  have  the 
same  effect  even  if  the  total  units  of  work  is  the  came  for  each  of  the  three  ph.asos.  This  is  because 
tasks  can  be  of  different  priority.  H<;w';vor,  all  the  carryover  from  a  previous  hyperperiod  can  be 
Ireat'.-'f  re  initi.al  Ci.’T.'Iaioiu:  for  a  given  hyperpe.dod.  f’.jgin.ning  '.villi  tlie  first  ki!'.3  slot  of  uncli  sv.Ii-.'dul- 


382 


ing  phase,  the  effect  of  the  carryovers  at  that  phase  vanishes.  For  example,  suppose  that  the  first  idle 
slots  for  input  DMA,  computation  and  output  DMA  exist  and  are  17,  21,  57  for  both  the  second  and 
third  hyperperiods.  In  addition,  the  total  amount  of  carryover  for  each  of  the  three  phases  are  2,  7, 1 1 
for  both  the  second  and  third  hyperperiods.  In  this  case,  we  know  that  all  the  deadlines  will  always  be 
met  as  long  as  there  is  no  deadline  missed  within  the  first  three  hyperperiods.  The  reasons  are 
twofold.  First,  the  carryover  does  not  increase  from  one  hyperperiod  to  the  next  hyperpeiiod. 
Second,  the  impact  of  carryovers  is  not  made  worse  from  one  hyperperiod  to  the  next  hyperperiod. 

Given  that  a  criterion  for  halting  the  simulation  or  emulation  has  been  created,  one  can  now  con¬ 
sider  a  pilot  study  to  provide  a  preliminary  evaluation  of  the  various  scheduling  approaches.  This  pilot 
study  was  fully  described  in  an  earlier  section. 


383 


3.  The  Simulator 

A  simulator  was  built  to  allow  for  the  convenient  study  of  a  large  number  of  cases.  This  tool  was 
specially  designed  to  take  a  basic  task  set  and  continue  to  increase  the  utilization  until  a  breaking 
point  was  reached.  One  begins  with  an  initial  task  set  defined  in  terms  of  number  of  tasks,  period  for 
each,  relative  input  DMA,  CPU,  and  output  DMA  utilizations.  The  program  systematically  searched  for 
that  task  set  which  was  schedulable,  but  for  which  ^y  increase  in  utilization  led  to  an  unscheduiable 
situation.  The  simulator  allowed  for  a  wide  variety  of  combination  of  scheduling  algorithms  and  for 
the  postponement  of  deadlines.  It  also  had  a  useful  plotting  and  analysis  facility  to  aid  the  user's 
understanding  of  the  complex  scheduling  problem. 

3.1  Users’  Guide 

The  simulator  was  written  in  standard  Pascal.  To  run  the  program,  a  user  only  needs  to  answer  the 
prompts.  There  are  3  sets  of  options  that  a  user  can  select.  The  first  set  concerns  the  selection  of 
scheduling  policies.  If  a  user  chooses  the  auto-mode,  then  the  simulator  will  run  each  task  set  with  5 
sets  of  policies:  the  3  stages  FIFO,  the  3  stage  Rate  Monotonic,  the  3  stage  Deadline,  the  3  stage 
Propagated  Deadline  and  finally  the  set  of  algorithms  consisting  of  FIFO  for  input  DMA  scheduling, 
Rate  Monotonic  for  CPU  scheduling  and  FIFO  for  output  DMA  scheduling.  The  last  set  of  algorithms 
is  used  to  model  the  effect  of  having  a  good  processor  scheduler  but  no  good  I/O  scheduler.  Instead 
of  using  the  auto-mode,  a  user  can  choose  the  manual  mode  to  specify  scheduling  algorithms  for 
each  stage  of  scheduling.  There  are  four  algorithms  to  choose  from:  FIFO,  Rate  MonotonIc,  Deadline 
and  Propagated  Deadline,  Finally,  in  either  auto-mode  or  manual  mode,  users  will  be  asked  if  they 
want  to  postpone  the  deadline. 

Having  specified  the  scheduling  algorithms,  one  needs  to  specify  a  set  of  tasks.  The  specification 
has  tV;o  modes:  the  statistical  mode  or  the  manual  mode.  If  the  statistical  mode  is  chosen,  a  user  will 
be  asked  to  specify  the  number  of  tasks,  the  average  ratio  between  input  DMA,  computation  and 
output  DMA  and  the  average  phase  del.ay  of  tasks’  starting  times.  Once  these  parameters  are  given, 
the  fas!<  set  will  be  generated  randomly.  If  the  manual  mode  is  chosen,  then  a  user  must  specify  each 
task  in  turn  by  answering  all  the  prompts  for  each  task. 

After  specifying  both  the  scheduling  algorithms  and  tasks,  a  user  will  be  asked  to  select  the  option 
to  control  the  execuhon  and  .style  of  reporting.  First,  a  user  will  he  asked  if  he  wishes  to  automatically 
.search  for  the  break  do  vn  thresliold  for  the  given  task  sot  pattern  for  each  of  the  scheduling  polic'os. 
'f  ihi-  un  .y.er  i.s  "yo;;”,  llie  thresl)o!d  will  Ijc  repoited.  Olherwise,  the  task  sot  will  he  executed  and  iho 
resiihr-,  of  -execution  .v;i|  be  reooited.  Hm  rcportiii';  stylo  can  he  cittier  verhnse  or  terse.  In  the 


384 


verbose  mode,  each  task  is  described  in  detail,  and  a  user  has  the  option  to  plot  the  scheduling 
process.  In  the  non -verbose  mode,  only  a  summary  of  the  execution  results  will  be  given.  Finally,  the 
results  of  execution  will  be  recorded  in  a  output  file  called  Result.sch.  There  is  another  output  file 
called,  EMU.,  which  is  used  to  drive  the  emulator  to  obtain  a  cross  validation.  For  the  use  of  the  file 
EMU.,  please  refer  to  the  chapter  on  the  emulator.  Both  of  the  files  will  be  rewritten  when  the 
simulator  program  is  executed  again.  Therefore,  the  results  should  be  copied  to  another  file  before 
reexecute  the  simulation  program.  In  the  following,  we  give  two  examples  to  illustrate  the  use  of  the 
simulator. 

3.2  Examples 

In  this  section,  we  give  two  examples  to  illustrate  the  use  of  this  simulator.  Example  1  uses  the 
manual  modes  and  plots  the  scheduling  process.  Note  that  in  the  plotting,  the  word  "end"  marks  the 
end  of  a  hyperperiod.  Example  2  uses  the  auto-modes  to  have  a  comparative  study  of  different 
scheduling  algorithms.  Note  that  for  each  set  of  scheduling  algorithms,  the  loads  at  which  all  the 
deadlines  are  just  met  and  the  loads  at  which  deadlines  are  just  missed  are  reported.  Because  of  the 
integer  nature  of  the  periods  and  computation  times,  these  two  values  are  sometimes  not  very  close. 


3.2.1  Example  1 :  The  Manual  Approach 
[LNKXCT  SCHEDU  Execution] 

The  results  of  this  execution  are  recorded  in  the  file  result.sch. 

To  save  the  record,  copy  it  before  executing  this  program  again. 

Scheduling  policy  can  be  specified  manually  for  each  stage. 

In  auto-mode,  the  program  steps  through 
IrFIFO,  2;Rate  Monotonic,  3:Deadline, 

4:Propagated  deadline,  5;FIF0  I/O  and  Rate  Monotonic  CPU.  Each  policy  set 
can  be  run  with  0  and  1  postponed  deadline. 

The  style  of  execution  can  be  either  verbose  or  not. 

policy  mode;  auto,  manual  (a/m)in 

Schodulitig  Policy  Selection  Guide 

1;  FIFO,  2;  Rate  Monotonic,  3:  Deadline,  4;  Propagated  Deadline 

Specify  the  No.  1  set  of  scheduling  policies 

DMA  in  scheduling  policy  =  (1..4)1 

CPU  scheduling  policy  -  (l..4)l 

D.’IA-out  scheduling  policy  -  (1..4)1 

Deidline  Postpone  =  (0..1)n 

Specify  a  new  Sv't  of  iiulicics  (y/n)n 

generate  the  task  set  randomly  (y/n)n 


385 


input  task  1  parameters 
period  =(  3<  P  <  700)20 

the  hyperperiod  now  is 

DMAin  time  =5 
Total  DMA-in  Load  now  is 

Computation  time  =5 
Total  Computation  Load  Now  is  0.2500 

DMAout  time  =5 

Total  DMA-out  load  now  is  0.2500 

The  initial  starting  time  of  the  task=(>=0)0 
specify  next  task  (y/n)y 

input  task  2  parameters 
period  =(  3<  P  <  700)40 

the  hyperperiod  now  is  40 

DMAin  time  =8 

Total  DMA-in  Load  now  is  0.4500 
Computation  time  =8 

Total  Computation  Load  Now  is  0.4500 
DMAout  time  =8 

Total  DMA-out  load  now  is  0.4500 

The  initial  starting  time  of  the  task=(>=0)0 
specify  next  task  (y/n)n 

Automatically  search  for  the  scheduling  thresholds  (y/n)n 
Verbose  Mode  (y/n)y 


Task 

Peri od 

DMAin 

CPU 

DMAout 

Phase 

1 

20 

5 

5 

5 

0 

2 

40 

8 

8 

3 

0 

2  tasks 

0.4500 

0.4500 

0.4500 

OZ 

FIFO 

FIFO 

FIFO 

Dead  1  i no  postpone 

All  (leadlines 

are  mt;  t 

within 

2  hyperpet’  i  ods  . 

Mo  Carryovers. 

plot  thi;  timelines  (y/n)y 

(1:  FrO'i  beginning,  2;  Around  the  end  of  Myperperiod,  3:  both)! 
(late  i;.<  .cut  ion  indicated  by  negative  Number) 

1111122222222 
1  1  1  11  2  2  2  2  2  2 
11111 


20 

0.2500 


01  lA  in 
evil 

I;.’!;', out 


386 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19  20 

DMA  in: 

1 

1 

1 

1 

1 

CPU 

2 

1 

1 

1 

1 

1 

OMAout : 

2 

2 

2 

2 

2 

2 

2 

2 

1 

1 

1 

1 

1 

0 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39end 

DMA  in: 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

2 

2 

2 

CPU  : 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

2  2 

DMAout : 

1 

1 

1 

1 

1 

0 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59  60 

DMA  in: 

1 

1 

1 

1 

1 

CPU 

2 

1 

1 

1 

1 

1 

DMAout : 

2 

2 

2 

2 

2 

2 

2 

2 

1 

1 

1 

1 

1 

0 

61 

62 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

73 

74 

75 

76 

77 

78 

79end 

DMA  in; 

1 

i. 

1 

■1 

▲ 

1 

2 

2 

2 

2 

2 

2 

2 

2 

CPU  : 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

2  2 

DMAout : 

1 

1 

1 

1 

1 

0 

81 

82 

83 

84 

85 

86 

87 

88 

89 

90 

91 

92 

93 

94 

95 

96 

97 

98 

99100 

Continue  the  plot  (y/n)n 


continue  =  (y/n)n 
EXIT 


3.2.2  Example  2:  The  Automode  Approach 
[LNKXCT  SCHEDU  Execution] 

The  results  of  this  execution  are  recorded  in  the  file  result. sch. 

To  save  the  record,  copy  it  before  executing  this  program  again. 

Scheduling  policy  can  be  specified  manually  for  each  stage. 

In  automode,  the  program  steps  through  1:FIF0,  2:Rate  Monotonic,  3:Doadline, 
4 ;  Pi'opagatod  deadline,  5:FIF0  I/O  and  Rate  Monotofiic  CPU.  Each  policy  set 
can  be  run  vn  th  0  and  t  postponed  deadline. 

The  style  of  execution  can  be  either  verbose  or  not. 

policy  mode;  auto,  manual  (a/m)a 
Run  postponed  deadline  (y/n)n 
gr.'iier.ite  the  ta.sk  set  raridomly  (y/ii)y 

no  of  tasks  {<-■  .30)  10 

■  ivcimijo  ratio  lor  OMAin,  Cofnputatioti  .and  OMAlmiL  ( t>  ■  g  ■  2  4  1)  7  5  7 


387 

phase  delays:  auto(Mean>0  and  SOX  of  period),  maaual (a/in)ni 
specify  average  phase  delay  as  XX  of  period,  X«  0 


Automatically 

search 

for  the 

schedul Ing 

thresholds  (y/n)y 

Verbose  Mode 

(y/n)n 

,  _  _  _  . 

Task 

Period 

DMA  In 

CPU 

OMAout 

Phase 

10  tasks 

0.4462 

0.3039 

0.3928 

OX 

FIFO 

FIFO 

FIFO 

Deadline  postpone 

All  deadlines 

are  met 

,  within 

2  hyperperiods. 

Mo  Carryovers. 

Task 

Period 

DMA  In 

CPU 

DMAout 

Phase 

10  tasks 

0.4896 

0.3039 

0.3984 

OX 

FIFO 

FIFO 

FIFO 

Deadline  postpone 

Task:  6,  Instance:  1,  first  missed  deadline  at  time  37 


Task  Period  DMAin  CPU  DMAout  Phase 

10  tasks  0.8904  0.6079  0.7857  OX 

Rate  Monotonic  Rate  Monotonic  Rate  Monotonic  Deadline  postpone  >  0 

All  deadlines  are  met  within  2  hyperper iods .  No  Carryovers. 


Task 

Period 

DMAin 

CPU 

DMAout 

Phase 

10 

tasks 

0.9349  0. 

6079 

0.7912 

0% 

Rate 

Monotonic  Rate 

Monotonic 

Rate 

Monotonic 

Deadl ine 

postpone  = 

0 

Task 

:  6. 

i  iistance : 

1 ,  first  mi 

Gsed 

deadl inn 

at  time 

37 

Task 


Pi!rio(.l 


DMAi  II 


CPU  riMAoiit 


Phase 


10 

tasks 

0.8904 

0.6079  0.7857 

0% 

Dead 

Line  Dead 

Line 

Dead  Line 

Deadl ine 

postpone 

All 

deadlines  are  met  within 

2  hyperperiods. 

No  Carryovers. 

Task  Period 

DMA  in 

CPU  DMAout 

Phase 

10 

tasks 

0.9349 

0.6079  0.7912 

0% 

Dead 

Lina  Dead 

Line 

Dead  Line 

Deadl ine 

postpone 

Task 

:  6,  instance: 

1,  first 

missed  deadline 

at  time 

37 

Task  Period 

DMA  in 

CPU  DMAout 

Phase 

10 

tasks 

0.9484 

0.6484  0.8293 

0% 

PPG 

Dead  Line  PPG 

Dead  Line 

PPG  Dead  Lina 

Deadl ine 

postpone 

All 

deadlines  are  met  within 

2  hyperperiods. 

No  Carryovers. 

Task  Period 

DMA  in 

CPU  DMAout 

Phase 

10 

tasks 

0.9984 

0.6706  0.8793 

0% 

PPG 

Dead  Line  PPG 

Dead  Line 

PPG  Dead  Line 

Deadl ine 

postpone 

Task 

1,  instance: 

18,  first 

missed  deadline 

at  time  1261 

Task  Period 

DMA  i  n 

CPU  DMAout 

Phase 

10 

tasks 

0.4984 

0.3039  0.4341 

0% 

FIFO 

Rate 

Monotonic 

FIFO 

Deadl ine 

postpone 

All  (londlines  are  met  within  2  hyporper  i  ods .  No  Carryovers. 


T  .1  s  k 


P'.;r  j  ijfl 


DMA  i  n 


CPU  DMAout 


Phase 


389 


FIFO  Rate  Honotonic  FIFO  Deadline  postpone  ■  0 

Task:  6,  Instance:  1.  first  missed  deadline  at  time  37 
continue  »  (y/n)n 

EXIT 


390 


V 


4.  The  Emulator 

The  scheduling  emulator  is  a  computer  system  designed  to  serve  as  a  testbed  for  experimentally 
investigating  the  applicability  of  the  scheduling  theory  we  have  developed  and  will  develop  in  the 
course  of  this  ongoing  work.  In  the  arena  of  computer  systems,  theoretical  results  often  are  not 
considered  relevant  until  their  applicability  has  been  demonstrated  in  a  real,  working  system.  We 
have  reached  the  stage  in  this  investigation  that  such  a  demonstration  of  the  theoretical  principles 
heretofore  derived  is  called  for.  It  is  entirely  fair  to  expect  experimental  validation  of  these  results, 
because  a  real  system  will  usually  be  subject  to  constraints  that  violate  certain  assumptions  of  the 
theoretical  models  and  the  extent  of  these  effects  must  be  determined. 

Our  major  requirements  for  the  emulator  are  that  it; 

•  support  a  realistic  model  of  scheduling  that  corresponds  as  closely  as  possible  to  the 
model  specified  in  Chapter  2; 

•  support  the  scheduling  zdgorithms  we  plan  to  study  and  facilitate  the  addition  of  any  other 
scheduling  algorithms  that  may  become  interesting  to  us; 

•  be  modular  and  easy  to  modify  along  other  dimensions  as  well; 

•  allow  observation  of  its  internal  behavior  at  a  fine  granularity,  and  allow  highly  precise 
timing  measurements  to  be  performed; 

•  be  relatively  easy  to  use; 

•  require  a  reasonable  amount  of  effort  to  build  in  the  time  available  (since  no  existing 
system  that  we  know  of  satisfies  the  foregoing  requirements). 

The  emulator  meets  all  its  goals  satisfactorily. 

It  is  interesting  to  contrast  the  roles  of  the  simulator  and  emulator  in  our  research.  The  theoretical 
model  of  scheduling  described  in  Chapter  2  is  already  sufficiently  complicated  that  experimentation  is 
needed  to  understand  its  behavior.  Working  out  examples  by  hand  becomes  cumbersome  and  time- 
consuming — an  automated  tool  is  very  useful,  and  the  simulator  serves  this  need.  The  simulator  is 
better  for  this  kind  of  work,  because  the  emulator  takes  much  longer  to  process  a  single  test,  is  less 
oriented  toward  interactive  v;ork,  and  produces  a  large  amount  of  data  for  each  test. 

For  the  same  reason,  the  simulator  is  better  able  to  perform  the  long  searches  required  to  'ind  the 
scheduling  breakdown  points.  However,  the  emulator  can  provide-  valuable  r|._tci!s  about  wli.at  hap¬ 
pens  in  a  real  system  when  the  te.-;t3  that  are  near  the  scheduliiuj  breaKdov/n  (roirrt  are  run.  Also, 
r-!sultr,  from  (he  ..mui.etor  help  fn  v  ilidat.e  the  sininlution  results  (•■uul  vice  verr.a).  In  short,  the  ro!'.-;:  of 
the  cimni.elor  and  cm.ul.itnr  oomj)i'>m:'"t.*.ry.  and  l.•'.c!h  ar.e  praclirully  noc.e:':'<!i  /  ’ci  thin  kind  i,ivn:> 
tin  itKip  y\]:j  fullowiivj  ;„;i,'ions  pi  r/'i  li;  lUOi'C  iih.'d  desni  y  .iiur,  I'i  ilu,'  "iiiui  a;  >i'. 


391 


4.1  Overview 

This  section  provides  an  overview  of  the  emulator's  structure  and  how  it  operates.  Details  of  its 
implementation  are  discussed  in  Section  4.2;  details  of  its  operation  are  given  in  Section  4.3. 

The  requirements  for  the  emulator  listed  above  led  to  its  design  in  a  straightforward  way.  The 
scheduling  model  states  that  the  actual  processing  of  input  DMA,  task  execution,  and  output  DMA, 
goes  on  without  interference  among  the  phases.  This  implies  that  they  can  be  performed  sequentially 
for  purposes  of  emulation.  The  results  of  the  input  DMA  scheduling  determine  the  task  initiation  times 
for  the  processing  phase,  and  the  completion  times  of  tasks  determined  by  the  processing  phase 
determine  the  initiation  times  for  ouput  DMA.  The  results  of  output  DMA  scheduling  reveal  whether  or 
not  deadlines  have  been  met.  Therefore,  the  emulator  consists  of  three  sequential  phases:  input 
DMA,  processing,  and  output  DMA. 


In  this  year's  work,  the  processor  is  the  focus  of  realism  in  the  emulator.  Once  processor  schedul¬ 
ing  under  realistic  conditions  is  better  understood,  attention  can  be  shifted  to  the  details  of  I/O 
devices,  applying  the  lessons  learned  from  applications  processor  scheduling. 


So,  very  generally,  the  emulation  proceeds  in  three  steps: 

pre-processing  From  a  description  of  the  task  set,  appropriate  input  data  for  the 
emulator  is  generated,  and  the  input  DMA  scheduling  is  simulated  to 
provide  a  list  of  task  initiation  times  for  the  emulation. 

execution  The  emulator  is  executed,  using  the  list  of  task  initiations  produced  in 

the  preceding  step  to  drive  the  emulation.  This  produces  a  record  of 
the  task  completion  times  for  each  task  initiation. 


post-proce.ssing  The  task  completion  times  are  input  to  a  simulation  of  the  output  DMA 
scheduling,  and  the  final  output  completion  times  are  compared  to  the 
task  deadlines  for  each  task  initiation,  to  determine  whether  or  not 
deadlines  were  met. 


The  task  descriptions  include  the  relevant  task  parameters  described  in  Chapter  2: 

1.  period 

2.  time  required  for  input  DMA 

■3.  time  required  for  tn.sl<  processing 

4.  time  r-,quired  for  output  DMA 

5.  pha.se  (time  of  first  initiation) 

6.  rime  required  for  input  interrupt  service 


Tin:  pre-  .'ind  por,t  ptncor-/'inq 
'  mnlulor  r'leini.i  on  the  -v.  i;): 


phases  iiuyolve  usir.q  various  soltwaio 
scent  '  under  the  llni:<  nper:’,‘ii m 


tools  u'u70io(!'.'il  for  the 
cvstom.  Tiu'-  e  [Moquuns 


392 


which  are  described  in  some  detail  in  the  sections  on  operation  and  internals.  The  execution  phase  is 
more  interesting,  since  it  must  emulate  a  real,  stand-alone  system. 

The  emulator  execution  phase  uses  a  CPU  that  runs  an  operating  system  kernel,  which  manages 
the  execution  of  applications  tasks.  So  far,  this  sounds  like  any  other  typical  computer  system,  it  is 
called  an  emulator  because: 

1.  The  "applications"  tasks  do  not  accomplish  any  real  application;  they  just  use  up  the 
amount  of  CPU  time  specified  for  them  in  the  task  descriptions. 

2.  I/O  is  handled  differently  than  in  a  real  system. 

In  real  systems,  and  in  the  emulator,  an  input  interrupt  signals  the  completion  of  an  input  DMA 
operation  to  the  processor,  and  results  in  an  application  task  being  dispatched  in  response  to  the 
input.  However,  there  are  not  any  real  I/O  devices  in  the  emulator  (since  no  real  I/O  is  going  on),  and 
tasks  are  only  dispatched  in  response  to  input  interrupts.  Therefore,  there  must  be  something  to 
cause  the  input  interrupts  which  drive  the  system.  Also,  we  know  that  the  input  interrupts  must  occur 
according  to  the  schedule  produced  by  the  pre-processing  phase  of  the  emulator.  So  we  need  some 
kind  of  device  that  will  produce  the  interrupts  at  the  specified  times.  This  can't  execute  on  the 
applications  processor  (perhaps  "simulating”  the  occurrence  of  interrupts),  since  it  would  interfere 
with  the  measurements  by  introducing  an  overhead  that  would  not  be  present  in  a  real  system. 

To  understand  our  approach  to  this  problem,  consider  the  hardware  configuration  shown  in  Figure 
4-1 .  This  is  a  standard  node  of  the  Archons  Interim  Testbed  system,  which  also  supports  the  Archons 
project’s  Alpha  distributed  operating  system  kernel.  Each  node  consists  of  a  Multibus  chassis  con¬ 
taining  the  following: 

•  two  68010-based  Sun  Microsystems  Version  1.5  processor  boards  (with  1M  byte  of  on¬ 
board  memory  each). 

•  one  0.5M  byte  Multibus  memory  board 

•  one  3Com  Ethernet  controller  (with  Ethernet  tap) 

•  one  custom-built  interrupt  generator  board  (to  provide  inter- processor  interrupts) 

With  this  in  mind,  tlie  emulator  execution  phase  is  implemented  as  follows. 

One  of  the  processor.^  runs  VHS,  a  simple  kernel  that  provides  multi-tasking  and  scheduling  to  run 
the  applications  tasks.  This  kernel  is  not  radically  diltcrent  from  mo.st  small  kernels,  except  that  it  is 
smaller  than  most,  since  any  funclinnality  not  directly  relevant  to  scheduling  was  omitted.  VHS  will  be 
discunc'-.d  in  more  in  Coction  -1.2,  In  itie  diagram,  this  processor  is  dos'gnat'id  "VHS  Host." 


394 


The  other  processor,  designated  "IntGen  host,"  runs  a  simple  program  called  IntGen  ("Interrupt 
Generator")  that  causes  the  I/O  interrupts  to  be  generated  at  the  intervals  specified  in  the  list  of  task 
initiation  times  (hereinafter  referred  to  as  the  "interrupt  log")  produced  in  the  pre-processing  step. 
The  interrupt  log  is  downloaded  with  the  IntGen  program,  which  uses  the  Sun  1.5  processor's  timer 
facilities  to  time  the  intervals  between  interrupts.  Since  the  Sun  1.5  processors  can’t  directly  cause 
interrupts  on  the  Multibus,  the  IntGen  host  uses  the  interrupt  generator  board  to  generate  them.  This 
board  is  a  simple  Multibus  device  that  will  cause  an  interrupt  of  a  specified  Interrupt  Priority  Level 
(IPL)  when  a  value  is  written  to  one  of  its  memory-mapped  I/O  registers.  (It  was  custom  designed  and 
built  by  another  member  of  the  Archons  project,  J.  Duane  Northcutt,  in  support  of  this  effort  and  the 
Alpha  kernel  effort.) 

In  short,  the  IntGen  program  runs  on  one  processor  and  drives  the  emulation  by  providing  input 
interrupts  to  the  VHS  processor  at  the  intervals  determined  in  the  pre-processing  phase.  VHS  runs 
and  schedules  the  tasks  specified  in  the  task  description,  using  the  specified  scheduling  algorithm,  in 
response  to  the  input  interrupts  produced  by  IntGen.  This  arrangement  is  illustrated  schematically  in 
Figure  4-2. 


Emulates  I/O  Environment  Runs  ScJieJulIng  Algorithm 

Emulates  Application  Loads 


Figure  4-2:  Emulator  System  Organisation 


The  EUicrnet  coiui oiler  is  used  only  for  downloading  IntGen  and  VMS,  and  for  uploading  the  results 
of  the  emulation.  The  Multibus  monkery  is  currently  unused,  but  may  be  used  to  hold  the  interrupt  log, 
should  it  ever  grow  too  large  to  be  kept  in  tho  processors'  local  memories. 


For  DMA  scherluling,  the  emulator  currently  supports  the  following  algorithms,  in  prnemntive  and 
non-prr.on-pi.ive  versions:  FIFO,  rate  mnnofonic,  deadline,  and  propng.ated  dearllino.  For  proo'-rsor 
schetluling,  the  follo'-viruj  algorithms  ••.upport'^d  in  pre  emptive  form  only;  rat"  n'.oiilonio,  d.  '  vJ'iin?, 
propa'|.';’'.rl  de-i..lline.  ,'  l.’,n,  tin;  umul.'.iur  r  uj'putls  po.'.'ponu'd  rle.'dii.ii  .-,  (as  d'”  ’',!  IF.  ,1  in  Cfiantor  h). 


395 


4.2  Emulator  Internals 

This  section  discusses  the  internals  of  the  emulator,  and  serves  as  a  combination  of  design 
rationale,  design  description,  and  implementation  description.  It  provides  fewer  details  than  separate 
documents  would,  and  attempts  to  focus  on  those  details  as  they  relate  to  the  design  requirements 
and  the  nature  of  the  scheduling  problems  we  are  studying.  We  discuss  the  VHS  kernel  first,  then,  in 
less  detail,  the  other  tools  required  for  emulation. 

4.2.1  VHS  Internals 

We  will  approach  VHS  internals  by  first  explaining  how  our  requirements  for  the  emulator  led  to 
certain  design  goals.  Then  we  give  a  brief,  general  description  of  VHS.  Next  we  go  into  more  detail 
on  the  more  important  parts  of  the  kernel,  and  finally  discuss  sources  of  variation  in  measurements 
relating  to  emulation  results. 

4.2.1 .1  Goals  and  Description 

The  requirements  for  the  emulator  (listed  on  page  19)  seem  to  us  to  imply  two  major  design  goals: 

•  observability  of  internal  state 

•  powerful  underlying  mechanisms 

Observability  of  system  state  is  an  important  requirement,  since  we  have  to  be  able  to  find  out 
fine-grained  details  of  what  happens  during  the  execution  of  tests  to  answer  any  questions  that  might 
arise  about  the  behavior  of  the  scheduling  algorithms  and  effects  of  factors  that  arise  in  emulation  but 
not  in  simulation.  However,  it  is  not  required  that  these  details  be  known  in  real  time--in  fact,  since 
task  execution  times  are  measured  in  tens  of  milliseconds,  that  would  be  impractical.  We  chose  to 
keep  a  log  of  significant  events  in  the  kernel  as  they  occur,  and  examine  the  log  after  the  experiment 
is  over  to  find  out  what  happened.  The  mechanisms  we  use  to  do  this  are  discussed  in  subsequent 
sections. 


Powerful  underlying  mechanisms  arc  important  in  the  VHS  design.  Each  of  the  major  facilities 
provided  by  the  VHG  kernel  is  supported  by  its  O’wn  particular  nl}straction,  r.s  v/e  v/ill  see.  Th,e  major 
facilities  are  tnsk  control,  scheduling,  and  data  gathering,  and  arc  discussed  in  more  detail  below. 


4.2. 1.2  Details 

Some  ba.sic  decisions  about  in’plcmcntatiori  were  made  at  the  outset.  VViiii  the  software  do''eiop- 
tnetit  tools  and  exp  ar'se  at  our  disposal,  it  was  clo-ar  that  the  Lf  St  approach  v'as  to  wri‘«  the  kertiel  in 
the  G  ianquago,  u-.iuu  as,  little  .is.sofnbl/  code  .  nossibl-.*  for  lov.'ust  I  niachioe  •  1  p  o'.rlont 
pan:,  (such  as  f:oli I  ‘  !:.!rt  ind  csa'  -t  sv/iteli:; aj).  The  o:aul..'''i  d  p.-d  on  a  .'iiii.p/lCO 


workstation  running  the  Sun  1.3  release  of  the  Berkeley  4.2BSD  Unix  operating  system.  Both  VHS 
and  IntGen  are  standalone  programs  that  run  on  the  Sun  1 .5  processor.  A  listing  of  the  source  code 
is  provided  as  an  appendix  to  this  report,  but  the  following  discussion  does  not  make  any  direct 
reference  to  the  code. 

The  VHS  kernel  and  all  user  tasks  share  the  same  address  space,  and  run  in  supervisor  mode.  Task 
are  created  statically,  only  at  system-build  time,  and  are  never  destroyed.  Elaborate  inter-task  protec¬ 
tion  schemes  and  dynamic  allocation  and  deallocation  of  resources  were  considered  irrelevant  to  the 
purpose  of  the  emulator,  and  therefore  were  not  implemented.  However,  nothing  in  the  system 
precludes  adding  any  of  these  features,  should  they  ever  become  necessary. 

Task  control  is  the  first  major  facility  we  will  examine.  It  is  achieved  through  the  use  of  a  standard 
low-level  synchronization  mechanism,  the  counting  semaphore.  As  we  mentioned  earlier,  each  ap¬ 
plication  task  is  specified  statically  at  system  build  time,  and  is  created  and  made  ready  to  run  at  run 

time  when  the  system  is  initialized.  Each  task  has  die  following  structure; 
begin  TaskN 
do  fopevep 

wait  on  TaskN ’s  semaphope 
use  CPU  time  specified  fop  TaskN  execution 
end  do 
end  TaskN 

When  an  Input  interrupt  occurs,  the  following  is  done: 
begin  lOIntepruptN 

clear  interrupt  condtion 
compute  task  deadline 
signal  TaskN's  semaphore 
end  lOInterruptN 

Each  task  automatically  finishes  processing  for  each  initiation  before  starting  processing  for  the  next 
(or  pending)  initiation.  In  other  words,  if  a  task  is  processing  on  behalf  of  input  interrupt  n  and  input 
interrupt  n  +  1  occurs,  the  task  finishes  its  processing  of  interrupt  n  before  starting  to  process  inter¬ 
rupt  n+  1.  The  count  in  the  counting  semaphore  keeps  track  of  how  many  initiations  are  pending  for 
a  particular  task  (few,  we  hope,  if  the  tasks  are  to  be  schedulable). 

The  semaphore  is  a  simple,  standard  synchronization  mechanism  that  forms  the  basis  of  the  task 
control  system  described  here.  It  is  sufficiently  general  that  we  arc  confident  it  will  remain  useful  if 
the  system  is  enhanced.  Possible  enhancements  could  include  consideration  of  precedence  relation¬ 
ships  among  tasks  or  synchronization  activities  (such  as  mutual  exclusion  of  access  to  resources) 
among  tasks  that  create  dependencies  that  must  bo  accounted  for  in  scheduling.  These  could  be 
controlled  using  semaphores  as  well. 

The  next  important  facility  is  scheduling.  In  Vi  IG,  we  do  not  attempt  to  provide  complete 


397 


pciicy/mechanism  separation  in  the  scheduler,  but  provide  a  mechanism  to  allow  different  schedul¬ 
ing  policies,  possibly  supported  by  different  underlying  mechanisms,  to  coexist  in  the  system  and  to 
be  selectable  at  system  startup.  This  is  sufficiently  flexible  for  our  purposes,  and  we  consider  it  a 
good  compromise  between  a  fully  general  policy-based  scheduler,  which  is  ’  er''  difficult  to  build  and 
would  involve  unacceptable  amounts  of  overhead,  and  an  inflexible  scheduler,  which  what  is  usually 
provided  by  operating  systems  today,  but  clearly  useless  for  our  purposes. 

Each  policy  must  be  implemented  by  providing  a  set  of  routines  that  correspond  to  a  set  of  standard 
scheduling  interface  routines  recognized  by  the  kernel.  In  some  cases,  modifications  may  have  to  be 
made  to  other  parts  of  the  system  to  provide  the  information  needed  for  the  scheduling  policy  to 
operate  (such  as  priorities).  The  set  of  standard  scheduling  interface  routines  are  described  below. 
Note  that  each  task  has  a  task  control  block  (TCB)  that  contains  all  the  system’s  knowledge  about  that 
task,  and  a  scheduling  control  block  (SCB)  that  contains  the  scheduler's  knowledge  about  that  task 
(and  is  part  of  the  TCB),  Also,  each  task  has  a  task  identifier  (TID)  unique  to  that  task. 

TaskInitO  Initializes  data  structures  used  by  the  scheduler,  and  sets  up  the  idle 

task,  which  runs  when  no  other  task  is  ready  to  run. 

lnit(TCB,  SCB)  Initializes  the  scheduling  control  block  for  the  task  corresponding  to 
TCB  with  values  from  SCB. 

Ready(TID)  Makes  the  task  whose  identifier  is  TID  ready  to  run  (but  does  not 

reschedule — when  the  next  reschedule  occurs,  the  change  in  status 
takes  effect). 

Block(TID)  Blocks  the  current  task  (but  does  not  reschedule — when  the  next  res¬ 

chedule  occurs,  the  change  in  status  takes  effect). 

ReschedO  Will  reschedule  if  necessary,  depending  on  the  scheduling  policy.  If  it 

does  reschedule,  it  determines  which  task  should  be  running.  If  it  is 
the  current  task,  it  does  nothing.  If  it  is  a  different  task,  it  changes  the 
current  task  state  to  ready  to  run  (unless  it  is  already  blocked),  and 
switches  to  the  new  task. 

By  convention,  these  routines  are  named  SpnTaskInit,  Spnlnit,  etc.,  whe''o  ii  is  the  scheduling  policy 
number  (e.g.,  Sp) Block,  etc.).  The  entry  points  corresponding  to  these  routines  for  eacit  scheduling 
policy  are  entered  in  a  jump  table  indexed  by  the  scheduling  policy  number  (determined  at  system 
st.artup).  In  the  code,  a  macro  that  indirects  through  the  jump  table  is  used  to  call  these  routines,  so 
the  "generic"  routine  call  READY()  (for  example)  acces.seo  the  actual  routine  Sp2Ready  if  .scheduling 
policy  2  i.s  being  used.  Each  scheduling  policy  is  a  separate  code  module  in  llie  kernel,  t)y  conven¬ 
tion.  The  exiclmg  modules  schpol  I  .c  and  schpol2.c  provide  examples. 


A.'it;nuglr  ;hi.c  lo  -i.Cod  .illovv.-;  rlilfcrent  mecli.'iii'rim.o  for  diffc-i'.mt  policies,  (he  curi.ont  r.ot  'if  jiGlicieS 


uses  the  same  underlying  abstraction — a  queue  abstraction,  implemented  by  a  queue  management 
package  (in  the  code  file  queue. c).  This  package  provides  the  queue  manipulation  operations  neces¬ 
sary  to  manage  a  task  ready  list  and  implement  a  variety  of  priority  functions  efficiently.  This  queue 
management  package  is  also  used  (with  very  slight  modifications  that  will  eventually  be  propagated 
back  to  the  kernel)  to  provide  major  support  for  some  of  the  other  tools  in  the  emulator,  as  we  will 
mention  in  a  later  section.  It  qualifies  as  one  of  the  "powerful  underlying  mechanisms"  mentioned 
above. 

The  third  important  facility  provided  by  the  VHS  kernel  is  data  gathering.  This  is  based  on  the 
abstraction  of  an  event  record,  a  record  of  a  significant  event  that  occurred  during  exaction.  Event 
records  are  recorded  using  a  macro  invocation  in  the  code  in  places  where  events  of  interest  occur. 
For  example,  just  before  the  call  to  the  task  switch  routine,  the  LGOEVENT  macro  is  invoked  to  record 
the  event — a  task  switch.  The  event  records  are  stored  sequentially  in  a  buffer  in  memory  while  the 
emulation  run.  When  the  experiment  is  over,  the  event  records  stored  in  the  event  buffer  constitute  a 
record  of  the  significant  events  that  occurred  during  the  experiment.  They  are  then  examined  to 
determine  exactly  what  happened.  Each  event  record  consists  of  an  event  identifier  (e.g..  Task 
Switch),  a  time  stamp  (32  bits  of  clock  with  granularity  of  approximately  3.5  microseconds),  and  a 
parameter  (for  example,  the  TID  of  the  task  being  switched  to). 

Calls  to  LOGEVENT  can  be  inserted  at  any  point  in  the  code  at  which  an  interesting  event  could 
occur.  This  allows  considerable  flexibilty — for  example,  if  an  experiment  raised  a  question  about  what 
is  happening  in  a  certain  piece  of  code,  a  LOGEVENT  call  could  be  inserted  and  the  experiment 
re-run  to  determine  what  is  happening  there.  The  event  record  also  provides  highly  precise  timings  of 
the  intervals  between  events,  and  so  is  a  valuable  performance  measurement  aid.  It  is  also  useful  in 
debugging,  since  it  provides  a  trace  of  the  events  that  occurred  proximate  to  the  manifestation  of  a 
bug. 

Although  event  logging  is  not  a  new  idea,  this  particular  facility  is  the  result  of  tiu;  author's 
perierice  'with  similar  facilities  used  in  measuring  various  aspects  of  ttie  perform.-mc';  of  the  ' 'r.i,; 
operating  system  over  a  period  of  four  years.  This  facility  is  also  used  in  the  Alpha  !;ernoi,  and  '.'as 
already  proved  valuable  for  debugging  and  performance  measurement  in  that  cojiteat.  Tiie  to;  Is 
de'^elopcd  for  interpreting  the  event  logs  are  quite  sophisticated,  and  add  to  the  v-aiuo  of  iho  isui; 
They  are  described  in  section  4.2.2. 


399 


4.2.1 .3  Sources  of  Variation 

These  are  simply  the  factors  in  emulation  that  tend  to  make  the  results  of  an  emulation  run  differ 
from  those  of  a  simulation  for  the  same  inputs.  The  sources  of  variation  we  have  discovered  so  far  fall 
into  three  rough  categories:  timing  phenomena,  uninteresting  overheads,  and  interesting  overheads. 
We  will  briefly  discuss  the  members  of  each  category  in  order. 


Timing  phenomena  include  the  following  effects: 

•  Clock  drift  between  the  VHS  and  IntGen  host  processors:  This  is  due  to  the  slight  dif¬ 
ference  in  clock  rates  between  different  CPUs.  It  amounts  to  approximately  0.02%  (or 
about  1  millisecond  difference  after  five  seconds).  In  other  word,  it  is  fairly  small,  relative 
to  task  execution  times  measured  in  tens  of  milliseconds  and  experiment  run  times  on  the 
order  of  10  seconds. 

•  Variations  in  IntGen:  These  are  due  to  variability  in  the  time  required  to  schedule  the  next 
interrupt  and  cause  the  current  interrupt  to  occur.  These  seem  to  vary  on  the  order  of  0.5 
milliseconds  for  each  interrupt.  This  is  still  a  relatively  small  effect,  but  we  are  inves¬ 
tigating  means  of  limiting  the  variation. 

•  Variation  in  "time-wasting"  routines:  These  routines  use  up  CPU  time  for  the  user  tasks. 
Variations  here  are  extremely  small  over  long  periods. 

Uninteresting  overheads  include  the  following: 

•  Unavoidable  overheads:  The  Sun  1.5  CPU  requires  certain  periodic  interrupts  to  coor¬ 
dinate  essential  housekeeping  actions,  such  as  software  dynamic  memory  refresh  and 
bus  timeouts.  These  effects  seem  to  be  relatively  small,  but  we  are  working  on  charac¬ 
terizing  them  more  exactly. 

•  Simultaneity  of  interrupts;  In  our  emulation,  we  use  one  processor  to  emulate  possibly 
more  than  one  I/O  device.  When  interrupts  are  scheduled  (as  specified  in  the  interrupt 
log)  to  occur  simultaneously,  they  must  in  fact  be  dealt  with  serially  by  IntGen.  This 
increases  interrupt  latency  slightly  in  these  cases.  However,  in  the  real  world  it  is  doubt¬ 
ful  that  many  interrupts  occur  precisely  simultaneously. 

»  F.voiit  Ic'j'jii'tj  overheads:  The  act  of  observation  rnfi.;et3  the  observations,  Since  logging 
everds  takes  time,  vyliich  r.iiow.s  up  in  the  o;<peiimenl  results.  We  have  been  considering 
this  p.irt  of  ;h*;  "normal"  syidem  ove(hc,ad.  However,  wo  plan  to  tuno  tire  event  !og<)ing 
routine  ti;  m;'ko  it  faster  (it  currently  takes  around  160  mic.rosccotid.',  to  log  an  event). 
P.vent  longing  is  not  quite  as  simple  a.s  it  may  sound,  since  mutual  e.'.'olu.sion  on  the  buircr 
pool  indices  is  required  to  prevent  "lost"  events,  or  sidpned  event  slots  in  the  log. 


Intel '-■sting  ov-aho.sds  involve  cll<;cts  that  cue  common  in  re.sl  systems  and  ar.D  of  leoitimniu  interest 
in  "I  fr'iidi:/!  sche'.'uiir-!  i"/s,tom  artunic/ bsliavo.  T.hoy  include: 

'king  It  S'.;.  -,0  .1  cnitain  .ai-cai.t  of  timo  t.;  lia.-uiin  m  i'lgat  mt'  .".'ikk.  I,n 

)  !:<•■  i.-':  n  M'e  . .  1  a;er."l-'  nf  tini;;,  i:li:  lil-j  k  si:'/  lO 

:,sal.'r  iiv,'i:',  v*  fnr  ir,'.  rrn-v  .uni  tic:  m-  v  s,,,  nn,;  nii  a. '  i  'd 


'  l.a.oiM'pt 

I;  !IVV 

■  t  .V.IIV  ••  ■ 


400 


•  Task  scheduling  overhead:  The  very  operation  of  scheduling  takes  a  certain  eimount  of 
time,  which  can  vary  with  the  scheduling  algorithm  and  possibly  with  the  number  of  tasks 
being  scheduled,  and  perhaps  with  other  factors  as  well.  This  is  one  of  the  criteria  that 
determine  how  well  a  scheduling  algorithm  performs,  independent  of  its  theoretical  per¬ 
formance. 

•  Interrupt  latency  due  to  running  at  high  IPL:  This  is  caused  by  operating  system  code 
running  at  high  interrupt  priority  level  and  locking  out  I/O  interrupts  during  this  time.  If 
an  interrupt  occurs  which  the  processor  is  running  at  high  IPL,  it  is  not  acknowledged 
until  the  processor  returns  to  a  tower  IPL.  Raising  the  processor  IPL  is  commonly  used  in 
operating  systems  (including  VHS)  as  a  means  of  providing  mutual  exclusion  of  access  to 
data  structures,  and  so  it  is  an  important  effect  to  quantify.  It  can  also  be  caused  by 
interrupt  service  routines  themselves  running  at  high  IPL.  If  interrupt  service  times  at 
high  IPL  are  long,  this  could  become  more  of  a  problem,  and  is  an  effect  worth  noting. 

•  Priority  inversion:  This  effect  is  observed  when  interrupts  are  being  serviced  at  high 
priority  on  behalf  of  a  task  with  low  scheduling  priority.  When  this  occurs,  the  scheduling 
priority  discipline  has  been  violated  and  can  result  in  lower  performance  of  the  schedul¬ 
ing  algorithm.  Instances  of  this  effect  have  been  observed  in  some  experiments  already. 
It  is  definitely  an  interesting  effect,  and  raises  the  possiblity  that  the  way  interrupts  are 
handled  in  typical  processors  will  have  to  be  modified  if  effective  real-time  systems  are  to 
be  built. 


4.2.2  Other  Tools 

This  section  discusses  some  of  the  other  tools  used  in  the  emulator  system.  They  comprise  a 
substantial  amount  of  Code  and  consumed  nearly  as  much  development  time  as  VHS  itself. 

4.2.2. 1  IntGen 

This  is  the  program  that  causes  the  interrupts  listed  in  the  interrupt  log  to  occur.  It  is  structurally 
simple,  a  task  that  causes  the  current  input  interrupt  to  occur,  then  sets  up  the  Sun  1 .5  timgr  to 
provide  a  timer  interrupt  when  the  next  input  interrupt  should  occur,  and  then  waits.  The  tricky  part  of 
it  was  to  tune  the  interval  timing  to  take  into  account  the  overheads  of  causing  the  interrupt  and 
setting  up  the  next  interval.  IntGen  shares  some  support  code  with  VHS. 

4. 2. 2. 2  TSim 

This  is  a  special-purpose  event-driven  simulator  used  to  simulate  ihe  input  and  output  HMA 
scheduling  pha.ses  during  pre-  and  po.st- processing  phases,  respectively.  This  is  a  significant  piece 
of  code,  and  is  very  modular,  permitting  easy  addition  cf  scheduling  policies.  It  alr.o  uses  the  tiuoue 
management  package  both  for  the  event  list  and  for  the  scheduling  facilities. 


401 


4. 2. 2. 3  Process 

This  program  processes  and  interprets  the  event  log  generated  by  VHS  during  an  experiment.  It  is  a 
highly  general  tool,  and  its  design  is  based  on  considerable  experience  with  event  logging  systems 
(see  Section  4.2.1 .2).  Its  organizational  paradigm  is  a  collection  of  dynamically  allocated  instances  of 
finite  state  machine  recognizers,  which  can  keep  track  of  individual  operation  sequences  occurring 
concurrently  within  the  kernel.  Although  the  original  design  was  done  with  finite  state  machines  in 
mind,  the  system  as  implemented  allows  more  powerful  constructs. 

This  tool  provides  a  highly  flexible  method  of  performing  very  sophisticated  analyses  of  the  event 
log  data,  which  we  anticipate  will  become  a  valuable  capability  as  the  emulator  becomes  more  com¬ 
plex.  We  also  expect  it  to  be  valuable  in  the  context  of  the  Alpha  kernel.  This  program  also  uses  the 
queue  package  to  schedule  the  activations  of  the  concurrent  state  machine  instances. 

4. 2. 2. 4  Miscellaneous  Tools 

Other  tools  were  also  developed: 

•  expdef  and  expconf  interpret  the  experiment  specification  file  to  produce  data,  com¬ 
mand,  and  code  files  forconfiguration  and  operation  of  the  emulator. 

«  unbatch  breaks  a  file  of  multiple  experiment  specifications  into  separate  individual  ex¬ 
periment  specification  files  and  creates  directories  and  command  files  for  later  stages  of 
emulator  operation. 

•  autovhs  runs  the  emulator  by  interacting  with  the  VHS  and  IntGen  processors  via  a  serial 
line  interface  to  control  the  execution  phase  of  the  emulation.  With  unbatch,  this  allows 
the  complete  automation  of  the  emulation  for  an  entire  set  of  test  runs.  This  total  automa¬ 
tion  is  accomplished  through  coordinating  the  activities  of  three  separate  computer  sys¬ 
tems. 

•  time  is  a  set  of  several  programs  derived  from  the  VHS  kernel  that  are  used  to  measure 
clock  drift  betvyeen  processors,  calibrate  time-wasting  routines,  and  investigate  other 
liming  issues  of  interest. 


4.3  Operaiioii  and  Examples 

In  iliis  .section  we  illustrate  the  cppr.ution  of  tho  emulator  by  goinrj  through  one  example  lost  run  in 
dolHil.  This  will  provide  a  framev/o.k  for  understanding  the  more  general  exposition  of  th  )  fa.nvious 
soctiiuMs.  This  example  includes  ttio  pre-procus.iing.  e:<ecution.  and  post-proo.v-sing  pti.is-.s  i\r,  well 
a'>  ;i;<'  initjal  incuts  and  final  outputs  of  th-a  run.  -A  (..’imjr.-'.m  .if  fiv  informatiou  flow  during  C'fnu!''tion  is 
.jivm  ;n  Fi'iure  4-3. 


'I  pr(,c.;S!!  !r/ v/srl  iny  i''!,.'!-;,-  MU  :’r':u  I'-.  ;  sf  Con-iM.i'Kl-' nro:  !u(,sd  i) 1  tn-t 


:;l  ’ll 


■  ■'  I  >'  'I:  ("l  ip  '..r;'  li';  Ml  Mild  ij'-;  ,  j  j.rodU' 


403 


run,  a  directory  with  a  file  specifying  the  test  must  be  created.  This  is  usually  done  by  the  unbatch 
program,  which  takes  a  file  consisting  of  multiple  experiment  specifications  produced  by  the 
simulator,  breaks  it  up  into  individual  experiment  specification  files,  and  creates  a  subdirectory  for 
each  test. 


The  subdirectory  for  a  test  contains  the  experiment  specification  file  (called  exp.spec)  for  that  test. 
unbatch  also  produces  a  command  file  that  performs  all  the  tests  by  running  a  program  for  each  test 
that  reads  the  exp.spec  file  and  makes  a  command  file  that  performs  that  particular  test,  and  then 
executing  that  command  file. 


In  this  case,  only  one  experiment  was  to  be  performed,  so  we  do  these  initial  operations  manually. 


Initially,  the  subdirectory  contains  only  the  exp.spec  file: 


policy:  1 


task 

1 

70 

0.0 

task 

2 

36 

0.0 

task 

3 

180 

0.0 

task 

4 

420 

0.0 

task 

5 

420 

0.0 

miss 

0 

1 

10.0 

o 

00 

1 

6.0 

0.0 

10.0 

18.0 

6.0 

0.0 

4.0 

6.0 

12.0 

0.0 

10.0 

12.0 

6.0 

0.0 

16.0 

6.0 

12.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 


The  policy  line  gives  the  scheduling  policies  for  input  DMA,  task  execution,  and  output  DMA, 
respectively.  Policy  1  is  rate  monotonic,  so  in  this  example,  all  three  stages  are  scheduled  by  rate 
monotonic.  The  task  lines  give  the  characteristics  of  each  task:  the  task  ID  number,  the  period  (all 
times  are  in  milliseconds),  the  phase,  CPU  time,  input  DMA  time,  output  DMA  time,  input  interrupt 
service  time,  and  output  interrupt  service  time  (currently  unused).  The  miss  line  indicates  whether 
this  test  missed  any  deadlines  when  simulated. 


Then  we  execute  th  following  commands  manually  (note  "t3pap>”  is  the  Unix  prompt — in  this 
case, .the  last  component  of  the  current  directory  path  name): 
t3pap>  expUef 

computation  factor  =  1.000000 
nriA  factor  -  1.000000 
intui'i'upt  r.rr'/ice  factui’  -  1. 000000 
t3pap>  c ill', in. I  ‘  phase  I 
tJpap>  phasol 

IS  a  pro'jriini  iliat  reads  the  experiment  specification  file  'ind  produces  a  version  ‘or:,i  'Uod 

I'J'  r  ieput  by  .ctlv  .urograms: 

1  70U0()  0  KiO.M  C.OOO  0  0 

;■  !ei;00  Q  Pl'inO  cue;)  r)  p 

'  I  .'',1'Minn  0  le  If)  I  'lii',.-]  0  0 

a;!i:)  n  i  n  iof:  )  ,  )  ■,  i.uj  o  0 

I  :  ■  e  m;'1  il  I  el  ece,  i  i.ij;)  ()  i) 


fO’'  c 


404 


(Most  of  the  programs  from  this  point  on  deal  with  times  expressed  in  microseconds.)  expdef  also 
provides  an  ability  to  multiply  the  task  processing  times  by  constant  factors;  the  default  value  for  each 
of  these  factors  is  1.  expdef  also  produces  command  file  (phase!)  is  produced  that  runs  a  generic 
emulation  command  file  with  the  correct  parameters  (the  scheduling  algorithms).  We  make  that  file 
executable,  and  execute  it,  The  results  follow: 

+  p3  1  1  1 

+  expconf  -d  /temp 

p3  is  the  generic  emulation  command  file.  (The  initial  plus  signs  indicate  commands  executed  by  a 
comand  file.)  p3's  first  action  is  to  run  expconf,  a  program  that  produces  a  configuration  file  for  the 
experiment,  and  places  them  in  the  specified  directory  (where  it  can  be  conveniently  accessed  when 
building  the  VHS  and  IntGen  systems).  This  file  is  essentially  a  set  of  C  language  data  declarations 
equivalent  to  the  experiment  specification  information,  but  in  a  different  format. 


Now  the  input  DMA  phase  begins: 

+  tsim  t  0  -p  1  -g  3 

start  log  at  0,  input  dma  simulation,  rate  monotonic  priority 
The  hyper-period  is  1260000  us. 

0  missed  deadlines. 

tsim  is  the  simulator  used  to  simulate  input  and  output  DMA  scheduling.  In  this  case,  it  is  run  with 
options  that  specify  input  DMA  scheduling,  rate  monotonic  algorithm,  and  to  produce  3  hyperperiods 
of  simulation  data,  tsim  responds  by  printing  out  relevant  information.  The  result  of  the  simulation  is 
a  file  that  specifies  the  completion  times  of  the  input  DMA  operations  (which  are  the  initiation  times  of 
the  tasks)  that  will  be  input  to  the  emulator.  A  sample  of  its  format  follows: 


13000 

10000 

2 

10000 

36000 

18000 

1 

34000 

54000 

18000 

2 

18000 

60000 

6000 

3 

120000 

90000 

30000 

2 

18000 

106000 

16000 

1 

34000 

108000 

2000 

4 

312000 

This  gives  the  abscluto  time  of  task  initiation,  the  time  rolativo  to  the  last  initiation,  the  took  tn  start, 
anH  tlie  timn  left  t;efore  the  djivlline  for  that  la.'.k  inifiaticn.  .^'nother  file  cuntriining  rcn-giiiy  the  same 
information  in  a  i  liffersnt  form.at  is  produced  to  provid':;  ii’irat  to  the  next  roinmand. 


I'im  is  canal)lo  of  simulatiii!]  tack  scheduling  os  well  as  input  or  output  UM.A  ccho.lu'iug,  so  it  is 
11;, '‘d  to  do  th.’.i  ■..imulation  as  well; 


405 


+  tsim  -t  1  -p  1 

start  log  at  0,  task  simulation,  rate  monotonic  priority 
The  hyper-period  is  1260000  us. 

0  missed  deadlines. 

+  tsim  -t  2  -p  1 

start  log  at  0.  output  dma  simulation,  rate  monotonic  priority 
The  hyper-period  is  1260000  us. 

0  missed  deadl ines . 

The  first  execution  of  tsim  simulates  task  execution,  and  produces  an  output  file  of  task  completion 
times  that  provides  input  to  the  next  execution,  which  is  an  output  DMA  scheduling  simulation.  These 
three  executions  of  tsim  provide,  in  aggregation,  the  equivalent  of  the  simulator  run  on  this  test,  which 
provides  a  cross-check  of  the  simulator  results,  and  data  for  comparison  with  the  emulator  results. 


The  command  file  proceeds; 

-t-  egrep  miss  exp. spec 
miss:  0 

-r  mv  dmaout.sim  dmaout.tsiin 
+  mv  result. sim  result. tsim 

awk  -f  /usr/ses/vhs/cmd/ intrl  og .  awk  Gintrlog.def  > 

/usr /ses/vh s/ temp /Gin trl  og . c 

egrep  is  a  Unix  operating  system  utility  that  searches  for  specified  text  strings  in  files.  The  purpose  of 
this  command  is  to  print  out  for  a  human  reader  of  this  log  the  simulator’s  prediction  about  whether 
deadlines  will  be  missed.  Then  the  intermediate  results  files  for  the  pure  simulation  results  are 
renamed  go  they  won't  be  overwritten  by  later  executions  of  tsim  based  on  data  from  the  emulation. 


awk  is  a  Unix  operating  system  utility  that  provides  an  interpretive  programming  language  based  on 
pattern  matching.  This  particular  awk  language  program  converts  the  interrupt  log  produced  by  tsim 
into  a  file  of  C  language  declarations,  placed  in  the  same  directory  with  the  otiier  generated  C 

language  declarations, 
t-  pv/d 

t,hisdtr  =  /usr/seG/vhs/exp/ev«n/t3pap 
-  i:d  /us (• /se  .s /  vli s  /  t.enip 
+  cc  -c  G  i  II 1 1'  I  o.g  .  c 
cr.  -c  Ga  con  F  .  c 

ccnmi.inrls  rnniuailier  the  name  of  the  current  directory  ami  go  to  Ihn  directory  witti  the 
genornt  xl  C  langii.nge  configuration  dof.larations,  and  compilo  ‘hose  tioclaratioiis  to  object  code  (■:c 
I'.',  the  U  ooinpiler). 

►  '  ll  .  .  /  i  II  t  :JOIl 

■  in.Vfi  F  - /u 'i  r  /  5«s  /  vhs /.!X|)/'’ven/ t.3p  ap  M=l  intgen 

/I'lii/  Id  ^fl  -r  .'.noi)  i.iilry  o  inrijun  .  ,  /  i  ii  t  i|.i|i  /  cn  I  Us  i,  a  c  r. ,  o 

../  1 II n  /  i  M  i  I. .  o  ../iid'j  c/ioinFr.,!  .  .  /  i  n  t  .;p;ri  /  t,  i'lxm  u 
.  .  '  I II  L  ij  ■■  11 1  ■  ()  .  I  .  .  /  i  II i;  I,i'.';f)  1 1)  I.  .  'I  .  .  /  i. f;/fi/)/  c  ;  ^  c  I  <ig  ,  o 
,  '  I  1'  I  I  c 

.  ;  .III  I'D  c  1'  '  ■  ■  r 


I  '  uu  ,/ 1  '  j)  ■ j  n  t. (p  n 


406 


Then  the  command  file  changes  to  the  directory  with  the  interrupt  generator  program  and  executes  a 
program-building  utility  (make)  that  causes  IntGen  to  be  re-linked  with  the  object  files  just  produced, 
to  configure  it  for  this  test.  The  newly  linked  IntGen  program  is  then  copied  to  the  test  directory  for 
later  use.  Then,  the  same  is  done  for  VHS; 

+  cd  .  .  /sys 

make  E  =  /usr/ses/vhs/0xp/even/t3pap  S  =  1  vhs 
/bin/1d  -N  -T  4000  -a  entry  -o  vhs  . . /sys/col dstart . o 
.  . /sys/event . 0  .  . /sys/init . o  . . /sys/ ioi n tr . o  . . /sys/queue . o 
. . /sys/schpol 1 . 0  .  . /sys/schpol 2 . o  ../sys/sem.o  ../sys/spp.o 
.  . /sys/syscall . 0  .  . /sys/task . o  . . /sys/time . o  .  . /sys/trap . o 
.. /sys/ trap i nt . 0  .  . /sys/usar . o  . . /temp/Gexpconf . o 
.  . /temp/G i ntrl og . 0  .  . /I ib/1 ibvhs . 1  -1c 
/bin/rm  -f  core 

echo  "schPo1?Wl"  >  .  . /temp/schpol , adb 
adb  -w  vhs  <  . . /cmd/adbvhs  | 

tee  /usr/ses/vhs /exp/even/ t3pap/vhsadb 
compPct  at  hex  address  74d8 

dmaPct  at  hex  address  74dc 

intrPct  at  hex  address  74e0 

_schPol ;  1  =1 

cp  vhs  /usr/sas/vhs/exp/even/t3pap/vhs 

The  difference  with  VHS  is  that  a  script  for  the  debugger  (adb)  is  produced  to  patch  the  value  of  the 
scheduling  policy  for  this  test  into  the  re-linked  VHS  program.  Now,  we  start  the  process  of  running 
the  emulator: 

-*■  cd  /usr/ses/vhs/exp/0ven/t3pap 
+  rsh  archonsS  rm  -f  /pub/temp/ses/vhs . tf tp 
/pub/temp/ses/ f n  tgen . tf tp 
seg68  -o  vhs.tftp  -t  -d  -b  vhs 
-t-  rep  vhs.tftp  ai'chonsB  : /pub/temp/ses/vhs  .  tftp 
+  segOS  -0  intgen.tftp  -t  -d  -b  intgen 
+  rep  intgen.tftp  archonsS : /pub/temp/ses/intgen . tftp 
-t-  rm  intgen 

Back  in  the  test  directory,  both  IntGen  and  VHS  are  converted  into  formats  suitable  for  dovynloading 
to  etandaicng  processo.’'s  by  the  scg69  program  Then  they  are  copied  across  the  network  to  a 
nnehino  that  has  an  Ethernet  connection  to  (he  host  processors,  preparatory  to  dov.'iilo adiog  and 
exoi"  ’.ition.  At  tiiir,  point,  auiov'fvs  io  run.  Thi-;  pro(jram  comnainicates  wi-h  the  host  prnci  '  'lOrs  via  a 
'.r'ri.'il  line  interface  to  execute  the  commando  to  the  procooaorc  PROM  tnoriitors  to  can  -e  ''HG  and 
intGon  to  bo  downloarlejd,  exocutt.d,  and  to  upload  the  VM.G  core  image'.  The  following  poition  of  the 
icy  has  been  er.'it.ad  r-.bghtly  for  rc-'.da.bility: 


i.- 


I 


-'I  i.'ivi;.  .  I  o. 


"I  1  ’I  li;  I 


407 


+  autovhs 

••resetting  vhs^^^ 

>k  3 

Self  Test  completed  successfully. 

Sun  Workstation  Monitor  -  0x040000  bytes  of  memory 
Processor:  68010;  No  Parallel  Keyboard; 

PROM  1&3:  Rev  AC;  PROM  2&4:  Rev  AC 
>^^resetting  intgen^^* 

>k  3 

Self  Test  completed  successfully. 

Sun  Workstation  Monitor  -  0x040000  bytes  of  memory 
Processor:  68010;  No  Parallel  Keyboard; 

PROM  1&3:  Rev  AC;  PROM  2&4:  Rev  AC 
>^^^download  vhs^^^ 

>J 

cmd:  r 

start:  0x4000 
size: 

filename:  /pub/temp/ses/vhs . tftp 

23600  Bytes  Received  from  '/pub/temp/ses/vhs. tftp' 

>^^^download  intgen^^* 

>J 

cmd:  r 

start:  0x4000 
size: 

filename:  /pub/temp/ses/intgen . tftp 

26780  Bytes  Received  from  ’/pub/temp/ses/intgen. tftp’ 

>*^^start  vhs^^^ 

g  4000 

VHS  version  0,0  on  Sun-1 
262144  bytes  of  usable  memory: 
etext  29408  edata  34780  end  39984 

task  1:  comp  =  10000  dma  in  =  18000  dma  out  »  6000  intr  »  0 
...<lost  characters  due  to  lack  of 
hardware  flow  control  on  the  lines>... 

Ready  •••start  intgen^^^ 
t  g  4000 

intgen  version  1.0  on  Sun-1 
Init  done 

Goodbye,  cruel  world.  <from  IntGQn> 

Goodbye,  cruel  world.  <from  \/HS> 

Done  •••upload  core  itnago**^ 

cmd ;  s 
start : 
size : 

filename:  /tmp/vhs . core 
262144  Bytes  Sent  to  '/ tmp/vhs . core ’ 


Then  the  port  pioce.'-vitKi  ijoqins: 


408 


+  rep  archonsS: /tmp/vhs . core  . 

+  slurp 
+  process 

processed  749  events 

The  VHS  core  image  uploaded  to  the  archonsS  machine  is  copied  back  to  this  machine  and  a 
program  (sturp)  extracts  the  event  log  from  the  core  image.  Then  process  processes  the  events  and 
produces  a  list  of  task  completion  times  (output  DMA  initiation  times)  for  input  to  the  output  DMA 
simulation,  it  also  produces  some  other  outputs,  such  as  files  of  points  for  plotting  time-lines  (which 
will  be  discussed  later),  and  some  debugging  output,  such  as  a  human-readable  log  of  the  events. 
The  following  is  an  example  of  the  format  of  that  log: 

<ev  #><ev  ID>  <av  desexev  parameterxabs  event  t1nie>  <rel.> 


24:  11: 

I/O  intr  arrives 

2: 

t: 

4489346 

r: 

18854 

25:  12: 

Task  switch 

2: 

t: 

4490231 

r: 

885 

26:  10: 

computation  complete 

2: 

t: 

4500439 

r: 

10208 

27:  12: 

Task  switch 

0: 

t: 

4501064 

r: 

624 

28:  11: 

I/O  intr  arrives 

1: 

t: 

4505335 

r : 

4270 

29:  12: 

Task  switch 

1: 

t: 

4506116 

r : 

781 

30:  11: 

I/O  intr  arrives 

4: 

t: 

4507314 

r : 

1197 

31:  10: 

computation  complete 

1: 

t: 

4517158 

r : 

9843 

The  post- processing  proceeds: 
ts1m  -t  2  -p  1 

start  log  at  0,  output  dma  simulation,  rate  monotonic  priority 
The  hyper-period  is  1260000  us. 

2  504000  540455  missed  deadline  by  455 
2  576000  612382  missed  deadline  by  382 
2  1764000  1800135  missed  deadline  by  135 

2  1836000  1872113  missed  deadline  by  113 

2  3024000  3060074  missed  deadline  by  74 

2  3096000  3132261  missed  deadline  by  261 

6  missed  deadlines. 

mv  dmaout.sim  dmaout.em  ' 

+  mv  result. siin  result. em 
+  rm  vhs.core  vhs.tftp  intgen.tftp  vhs 

The  output  DMA  simulation  is  run,  with  input  from  the  emulation,  and  it  shows  that  some  deadlines 
were  missed,  but  r\ol  by  much — around  half  a  millisecond  at  most.  Finally,  some  cleanup  of  files  no 
longer  needed  is  performed. 


tloth  tsim  and  process  produce  files  of  point  values  that  aro  specifically  designed  to  bo  useful  in 
plotfing  a  time  line  of  the  task  and  DMA  executions.  The  format  of  such  a  point  file  is  ctraighlforv/ard: 


409 


18  .  2 
36  .  1 
54  ,  2 
60  .  3 
90  .  2 
•  •  • 

Anoth«r  program,  gcmd,  takaa  these  flies  of  points  and  produces  a  command  file  for  a  plotting 
program  called  p/ot^  that  will  produce  a  time  line.  One  can  specify  the  interval  of  time  (in 
milliseconds)  to  be  printed,  and  how  many  milliseconds  should  be  plotted  for  each  page  of  the  graph. 
An  example  of  the  graphical  output  is  shown  in  Rgure  4-4.  (This  is  the  first  page  of  the  output 
produced  by  gcmd  by  default  for  the  above  run.  Note  that  time  is  on  the  X  axis  and  task  ID  is  on  the  Y 
axis.  For  each  task,  input  DMA  times  are  plotted  slightly  above  task  execution  times,  which  are 
slightly  above  output  DMA  times.  The  symbol  refers  to  the  intitation  of  a  task,  and  '*]"  marks  the 
completion  of  a  task  (either  input  DMA,  CPU,  or  output  DMA).  By  convention,  the  idle  task  has  task  ID 
0. 


f 


Oc  / 


■f*'p  ('.  ‘f  fion  Univiji^.iiy  by  Ivor  Oj/.'hjm. 


100  150  200  250  300  350  400  450  5C 

Time  (milliseconds) 

Time  Line  —  Page  1 


411 


5.  Summary  of  Pilot  Study 

This  year  was  largely  devoted  to  conceptualizing  the  I/O  scheduling  problem  and  to  building  tools 
to  study  it.  Nevertheless,  a  pilot  study  was  undertaken  to  learn  about  I/O  scheduling.  The  results  of 
this  pilot  study  were  quite  clear  cut  and  promising.  We  summarize  the  pilot  study,  the  results  and  the 
implications  that  can  be  drawn. 

5.1  Pilot  Study 

Traditionally,  the  theory  of  processor  scheduling  is  based  upon  a  worst  case  analysis  approach. 
Such  an  approach  is  particularly  useful  when  there  exists  a  scalar  measure  of  performance,  and  when 
the  worst  case  performance  bound  is  sufficiently  high  from  an  application  point  of  view.  If  a  worst 
case  bound  is  deemed  to  correspond  to  poor  performance,  one  may  still  use  the  algorithm  if  the  worst 
case  seems  unlikely  to  arise  in  practice.  A  statistical  approach  is  generally  more  appropriate  when 
the  measure  of  performance  involves  many  factors  or  when  the  average  performance  is  sufficiently 
different  from  the  worst  case  performance.  In  the  case  of  scheduling  periodic  ta.sks  with  I/O,  the 
performance  of  scheduling  algorithms  is  measured  by  the  ability  to  schedule  a  task  set.  The  task  set 
is  often  measured  by  three  parameters:  the  input  DMA  load,  the  output  DMA  load  and  the  CPU  load.  It 
is  .’•ather  easy  to  create  worst  case  situations  corresponding  to  low  utilizations  for  which  deadlines  will 
not  be  met.  These  are  especially  unrepresentative  task  sets,  thus  a  statistical  approach  is  far  more 
appropriate  for  this  investigation.  Having  opted  for  a  statistical  approach,  we  still  must  determine  a 
method  a  creating  task  sets  to  test  the  various  algorithms.  The  approach  taken  was  to  create  task 
sets  which  had  some  generality. 


5.2  Experimental  Design 


A  general  goal  of  the  pilot  study  is  to  create  task  sets,  select  certain  scheduling  algorithms  and 
determine  if  the  algorithms  can  meet  all  the  timing  requirements  ot  the  tas^  set.  Uy  doing  this  with  a 
largo  number  of  task  sets  (using  eithn'i’  tiie  emulator  or  the  simul.'ilor)  one  obt.tins  a  picture  o'  the 
perform.ince  of  the  .scheduling  algo.-iilims  being  studied.  Cleorly,  Itiere  is  an  infinite  numh^r  of 
posr.iljlc  f.ack  sets  that  might  be  facerl.  Once  n  selected  subset  can  be  actually  r.tu'.'ied.  Wn  v.:;;!  to 
pick  a, 3  l.'irge  a  subset  as  possible,  am.l  ono  which  will  u.ilow  us  to  make  useful  cnn.clu.sions.  There  are 
several  terimioal  issues  which  arise.  First,  since  we  iir’.d  a  fixed  aiiUKiut  time  and  money  in  (.'f;:  hani 
Simuiaii, '.,70  attciiii’tf.’ri  to  devise  a  method  v/hic'n  wouid  allow  us  !o  rnu.  umi.-vj  ;h  ■  nomber  ul  i.a-.oo 


ttiat  coni' I  K  j  c/am.'nc.'l,  Sneonri,  we  )■>•:'!  ‘o  clioo.se  i'  n  whicl'  'vnn-  in 

;h;\i  il;.;  haharj  '.i  ■''rem'^  rcfiUiirs  us  to  r..\:iy  .,i.;  ;i  run  nv-,'  fT'.’ny 


;  SOI ;  'o  ram ' 


412 


should  try  to  keep  the  hyperperiods  as  short  as  possible.  If  one  picks  task  penods  at  random,  then 
extremely  long  hyperperiods  are  a  likely  consequence,  for  example,  if  one  picked  the  four  periods 
247,  258,  123,  and  643,  then  the  hyperperiod  would  be  1,185,527,382.  If  one  considered  a  task  set 
with  ten  tasks,  it  is  likely  that  the  resulting  hyperperiods  would  be  too  long  to  be  run  at  all.  It  is, 
therefore,  important  to  design  a  method  which  will  create  a  relatively  short  hyperperiod.  We  used  the 
following  approach.  We  began  with  a  set  of  7  prime  numbers,  {2,2, 2,3,3, 5, 7}.  All  task  periods  were 
drawn  by  randomly  selecting  a  subset  of  these  elements  and  multiplying  them  together  to  create  a 
period.  For  example,  if  2,  3,  and  7  were  picked,  then  that  task  would  have  period  42.  Using  this 
method,  we  insure  that  the  hyperperiod  is  always  less  than  or  equal  to  2520.  We  always  used  ten 
tasks  in  the  task  set.  There  is  an  important  concern  about  this  approach.  We  wonder  if  it  can 
generate  a  "bad"  task  set.  A  bad  task  set  is  one  with. low  utilization  that  cannot  be  scheduled.  The 
answer  to  this  is  "yes"  in  the  case  of  pure  processor  scheduling.  In  that  restricted  case,  bad  task  sets 
are  those  in  which  the  relative  periods  are  in  the  ratio  of  1  to  1.40  [Liu  73].  One  can  easily  generate 
such  ratios  using  these  prime  numbers.  It  is  still  necessary  to  have  the  relative  computation  times  be 
chosen  in  a  specific  way  to  obtain  a  bad  set;  however,  we  utilize  a  uniform  random  number  generator 
to  ensure  that  these  relative  computation  times  can  be  obtained.  Therefore,  this  method  has  the 
potential  to  generate  both  bad  and  good  task  sets  in  the  pure  processor  scheduling  case.  In  the  case 
of  I/O  scheduling,  it  is  reasonable  to  draw  the  tentative  conclusion  that  our  method  will  also  be  able 
to  generate  bad  task  .sets  in  this  broader  situation.  This  follows  because  it  can  generate  a  wide  range 
of  period  ratios  and  uses  a  uniform  random  number  generator  to  determine  DMA  and  computation 
utilization. 

Given  that  we  wanted  to  run  as  many  task  sets  as  possible,  the  simulation  program  was  constructed 
to  take  an  initial  task  set  and  systematically  scale  up  the  utilizations  at  each  of  the  three  phases  until  a 
task  set  was  created  which  could  no  longer  be  scheduled.  This  allowed  us  to  determine  a 
"breakdown  point"  for  each  algorithm  for  each  basic  task  set.  This  set  of  breakdown  points  is 
ospccially  valuable,  because  it  offers  a  direct  comparison  of  the  performance  of  the  competing  ah 
gorilhrns  under  fixed  conditions.  It  was  these  breakdown  points  which  offered  a  rather  clear  and 
consistent  picture  of  the  behavior  of  the  akjoritfims. 


Given  we  wore  looking  for  iac-akdown  points  for  a  set  of  algorithms  over  randomly  generated  task 
sets,  it  became  uc'-dul  to  further  refine  our  view  of  task  sets.  Clearly,  the  importance  of  I/O  ccherlul- 
ing  Jopencl.s  in  p.mt  on  the  amount  of  I/O  .in.'.ociaiod  with  the  tasks.  If  ta'd.s  are  "cornpiif  ^  bound" 
•'!irl  f.ave  relatively  .ma!l  jr.-u'unt-:  nf  I/O,  aic-n  I/O  scheduting  will  he  rulatiV'^  !y  unimpnrt.ant,  oimiiarly, 


if  arc  *. !  .iiiirl",  dii.n  l/r)  scliuduling 

u-  fo  iiifi |i r(  .  ail ,•  noli')"  c'f  "I/  O  r.onnd"  ■ 


be  of  .iii.ieh  greatoi’  impoi t.'ince.  li  i::,  [I;  aefere, 
GPU  aannd"  into  the  !  :c!’  r-ot  co!!:'r  ,i:;  ii  jnenees. 


413 


example,  (1,5,1).  This  means  that  we  are  creating  tasks  in  a  task  set  whose  CPU  utilizations  tend  to 
be  5  times  as  large  as  either  their  input  DMA  or  output  DMA  requirements.  It  should  be  noted  that 
tasks  are  created  at  random,  and  it  is  possible  for  one  or  two  of  the  tasks  so  created  to  actually 
require  more  input  and  output  time  than  CPU  time;  however,  on  average  the  tasks  will  require  5  times 
as  much  CPU  time  as  either  input  or  output  time. 


The  pilot  study  consisted  of  two  phases.  First,  we  made  runs  for  5  cases:  (1,5,1),  (3,5,3),  (1,1,1), 
(7,5,7)  and  (9,5,9).  These  span  the  spectrum  of  "compute  bound"  to  "I/O  bound".  The  pilot  study 
consisted  of  running  each  case  assuming  no  deadline  postponement  and  then  again  assuming  one 
deadline  postponement.  Second,  we  made  a  series  of  runs  for  the  (7,5.7)  case  to  develop  a  notion  of 
the  average  case  behavior  of  the  scheduling  algorithms  used.  Finally,  all  of  this  was  done  for  5 
scheduling  algorithms:  FFF,  FRF,  RRR,  ODD,  and  PPP  where  F  denotes  FIFO,  R  denotes  rate 
monotonic,  D  denotes  dynamic  deadline  and  P  denotes  propagated  deadline.  The  first  symbol  refers 
to  input  DMA  scheduling,  the  second  to  CPU  scheduling,  while  the  third  refers  to  output  DMA 
scheduling. 

5.3  Results  of  the  Pilot  Study 

5.3.1  The  importance  of  I/O  scheduling 


The  pilot  study  consisted  of  generating  task  sets  at  random  and  subjecting  them  to  the  I/O  and  CPU 
scheduling  algorithms  mentioned  earlier.  A  typical  task  set  consisted  of  ten  tasks  with  randomly 
generated  periods  and  phases.  As  was  discussed  in  the  previous  section,  the  input  DMA,  output  DMA 
and  CPU  computation  times  were  chosen  in  a  way  to  create  a  task  set  characteristic  of  I/O  bound 
tasks,  CPU  bound  tasks  and  mixtures  of  the  two.  A  set  of  attached  figures  (Figures  5-1  to  5-5) 
summarize  a  range  of  cases.  As  an  e.xample,  consider  Figure  5-1 .  In  this  case,  tasks  were  generated 
randomly  with  relative  utilizations  of  one  for  both  input  DM,4  and  output  DMA  and  five  for  the  CPU. 
Consequently,  these  tasks  lend  to  be  compute  bound.  Once  the  pei  iocls,  utilizations  and  phases  wore 
fixed,  the  utilizations  were  scaled  up  until  a  deadline  w.ns  missed  usirxj  a  particular  scheduling  algo¬ 
rithm.  The  same  procedure  v/as  carried  out  using  tl',o  same  scheduling  algorithms  but  with  one 
deadline  postponement.  In  each  of  the  figures,  FrF  refers  to  rlF-O  being  used  at  all  three  stages  nf 
scheduling,  wliile  FRF  uses  FIFO  for  I/O  scheduling  .'i.nd  rate  monotonic  for  processor  sch-aduling. 


RRR,  DR'D  and  PPP  refer  to  rate  mcnotonic,  dynamic  f’o.adiine  and  ptopag.atcd  deadline  being  used 
at  all  three  stages  of  seti'  duling  icspccfivcly.  In  Fi-jer.)  5  1,  v.-e  cb-'.jrvc  that  the  FIFO  tcheduliug 
alnoiithtii  missed  a  de  ';Cllir,<;  wiicn  the  !).''.(t‘.;.secl<  i.it:!,/. djcn  (the  (HOcox'',Oi  in  thi.s  esse)  /cncl.i  d  id.'d. 
V/ilh  CHI'  d. ■■/.dllni'  ment,  thi;  iw'!-!  ;nec!\  nh'i.-.-'ion  was  I’ai.;'  I  Ici  GD":'  a  d'-^;\, Fi,)-,  .  's 

r;i' I'c  I  "suit;'  F :  'IF  an,l  I’F.:  r  ':e. IF  ricc!'.  rt-s;  is.d 


414 


before  a  deadline  was  missed.  The  more  sophisticated  algorithms  (DDD  and  PPP)  were  able  to  meet 
all  deadlines  when  bottleneck  utilization  was  raised  to  100%. 

The  situation  is  changed  somewhat  as  the  I/O  utilization  is  increased  and  I/O  scheduling  becomes 
more  important.  Figure  5-2  considers  a  case  with  ten  randomly  generated  tasks  drawn  from  a  dis¬ 
tribution  whose  relative  I/O  to  CPU  utilizations  are  3  to  5.  The  CPU  is  still  the  bottleneck,  but  the  FFF 
and  FRF  algorithms  miss  deadlines  at  23%  and  37%  bottleneck  utilization.  One  deadline  postpone¬ 
ment  raises  these  to  38%  and  70%  respectively:  however,  this  is  far  inferior  to  the  100%  bottleneck 
utilization  achieved  by  RRR,  DDD  and  PPP  with  no  deadline  postponement!  This  situation  persists  in 
Figure  5-3.  Here,  ten  tasks  are  randomly  generated  from  a  distribution  with  equal  I/O  to  CPU  utiliza¬ 
tion  (1,1,1).  The  FFF  and  FRF  miss  deadlines  at  35%  bottleneck  utilization  and  reach  63%  and  70% 
with  one  deadline  postponement.  The  RRR,  DDD  and  PPP  algorithms  handle  92%,  98%  and  98%  with 
no  deadline  postponement.  Figure  5-4  presents  the  case  of  (7,5,7)  where  I/O  dominates  the  CPU. 
The  results  are  particularly  interesting.  Again,  the  sophisticated  algorithms  dominate  the  FFF  and 
FRF.  Here,  we  observe  the  benefits  of  the  PPP  algorithm  which  handles  93%  with  no  postponement. 
Finally,  Figure  5-5  shows  clearly  the  vital  importance  of  I/O  scheduling.  Tasks  are  generally  I/O 
bound.  Here  FFF  and  FRF  were  able  to  handle  only  17%  bottleneck  utilization  (input  and  output  DMA) 
and  only  30%  with  one  deadline  postponement.  This  terrible  performance  was  in  marked  contrast 
with  RRR,  DDD  and  PPP  v/hich  handled  98%  with  no  deadline  postponement. 

The  figures  are  typical  of  the  behavior  observed.  FIFO  scheduling  for  I/O  is  highly  erratic.  It  is 
possible  that  such  an  algorithm  could  achieve  a  decent  breakdown  utilization  (equivalent  to  RRR)  for 
one  phasing  but  for  another  to  miss  deadlines  at  a  very  lov/  utilization.  Clearly  I/O  must  bo  scheduled 
just  as  the  CPU  if  its  utilization  is  at  all  significant.  The  important  observation  is  that  a  simple 
algorithm  such  as  RRR  achieves  an  excellent  breakdown  utilization. 


To  sharpen  our  understanding  of  the  relative  behavior  of  these  algorithms,  wo  performed  a  second 
simulation  experiment.  This  consisted  of  yoner.atiny  ten  sets  of  ten  la.sks  usiuy  a  (7,6,7)  i.;!aiive 
utilization.  The  pfiase  was  choren  to  be  zero  for  oil  tasks,  and  no  deadline  postponemenl  v.  i:;  used. 


Each  of  the  ton  task  sets  was  scheduled  by  tlic  five  algoritfims  under  con-sidoraiion.  Wo  o:  -sont  the 


results  in  the  following  table. 


TABLE  1 


CASE 

FFF 

FRF 

RRR 

DDD 

1 

.  206 

.  ?.()G 

.897 

.807 

O 

/. 

.  ^  /  4 

.2  74 

.058 

.  0 )  1 

J 

.  3  I  ;J 

.  .0  42 

.927 

.  0  7,2 

4 

.  .102 

.  A'lP 

.■}2;i 

.  on 4 

5 

iUO 

.  -t'lO 

,  8.5  ) 

.on  1 

1  ■> 

.  -4  1  5 

.  80  1 

.040 

PI'P 
0  20 
0  52 
0  7.] 
' '  f;  4 
n:  I 


•  -0 


415 


BOTTLENECK  UTILIZATION 
0  and  1  DEADLINE  POSTPONEMENT 


1 


FRF  RRR  DDD 


Fiquro  5- 1 :  10  Tasks,  15  1,  phase  =  0 


416 


BOTTLENECK  UTILIZATION 
0  and  1  DEADLINE  POSTPONEMENT 


FF  FRF  RRR  DDD 


Figure  5-2:  10  Tasks,  3  5  3,  pi lase  =  0 


417 


Figure  5-G:  10  Tasks,  111,  phase  =  0 


419 


BOTTLENECK  UTILIZATION 
0  and  1  DEADLINE  POSTPONEMENT 


F  F  F 


FRF 


R  K  R 


ODD 


I  iyuro  5-5:  10  Ta:ik3,  0  5  0,  pitase  =  0 


420 


7 

.425 

.425 

.934 

.978 

.978 

8 

.445 

.498 

.890 

.890 

.948 

9 

.467 

.513 

.872 

.954 

.954 

10 

.554 

.554 

.926 

.983 

.987 

The  table  clearly  illustrates  the  drawbacks  to  using  an  algorithm  such  as  FIFO  and  the  benefits  of 
using  the  rate  monotonic  algorithm.  FIFO  has  an  average  performance  around  41%  with  a  range  of 
20%  to  55%.  This  is  in  marked  contrast  to  the  behavior  of  RRR.  It  has  a  rather  tight  range  of 
performance  making  its  behavior  fairly  predictable.  We  found  its  range  to  be  from  85%  to  93%  with  an 
average  of  about  89%.  Had  more  than  10  cases  been  run,  it  is  likely  that  the  range  would  have  been 
wider;  however,  from  our  experience  the  average  would  be  about  the  same.  The  90%  figure  is  a 
rough  measure  of  the  performance  of  the  RRR  algorithm  in  the  (7,5,7)  case.  The  minor  differences 
between  the  performance  of  FFF  and  FRF  simply  arises  from  the  fact  that  these  task  sets  tend  to  be 
I/O  bound.  Adding  a  good  scheduling  algorithm  to  the  CPU  has  very  limited  benefits,  since  the  I/O 
tends  to  be  the  bottleneck.  One  can  observe  that  FFF  and  FRF  have  a  wide  variance.  This  is  in 
keeping  with  our  experience.  It  is  possible  for  FFF  and  FRF  to  do  adequately  (rarely)  and  miserably 
(more  commonly). 

The  behavior  of  the  more  sophisticated  algorithms  such  as  ODD  and  PPP  is  even  better  than  RRR. 
The  ODD  algorithm  has  a  range  of  89%  to  98%  with  an  average  around  95%.  The  PPP  algorithm  is  the 
best  with  a  range  of  93%  to  99%  with  its  average  around  95%.  It  is  clear  that  these  algorithms 
dominate  RRR  in  terms  of  pure  performance,  but  the  gap  is  quite  small.  Given  the  simplicity  of  RRR 
and  the  relative  ease  of  implementation  compared  with  ODD  and  PPP,  these  data  offer  a  compelling 
case  for  the  use  of  RRR. 

5.3.2  The  bottleneck  utilization  result 

The  I/O  scheduling  problem  contains  three  scheduling  components.  As  such,  it  has  not  been 
treated  in  the  literaturo.  Surprisingly,  if  one  computes  the  utilization  at  each  of  the  three  stages  and  if 
one  of  these  three  utilizations  dominates  the  other  two,  then  the  .schedulability  of  the  task  set  is 
determined  by  the  bottleneck  stage.  This  means  that  the  schedulability  of  a  to.sk  .set  with  a  single 
bottleneck  can  bo  determined  as  if  itie  scheduling  were  a  single  stage.  This  reduces  a  three  dimi:n- 
sional  scheduling  problem  to  a  single  dimension  and  allows  one  to  use  known  results  in  the  literature 
for  ratr:  munotonic  or  <!e  :d!ine  .sch'.xluling. 


421 


5.3.3  Performance  of  Algorithms 

[he  three  stage  rate  monotonic  algorithm  (rate  monotonic  used  at  each  of  the  three  stages)  on 
average  ensures  that  all  deadlines  will  be  met  when  the  bottleneck  utilization  is  well  above  80%.  With 
one  deadline  postponement,  the  deadlines  are  on  average  met  when  this  utilization  is  well  above  90%. 
The  results  of  the  pilot  study  actually  show  that  these  are  somewhat  conservative  percentages.  It  is 
not  uncommon  to  see  RRR  with  no  deadline  postponement  achieve  over  90%  utilization  and  100% 
utilization  with  one  deadline  postponement. 

The  deadline  and  propagated  deadline  algorithms  improve  the  average  threshold  at  which  dead¬ 
lines  are  missed  to  well  above  90%  and  to  well  above  95%  when  a  single  deadline  postponement  is 
allowed.  Again,  these  are  fairly  conservative  percentages  based  on  the  pilot  data.  It  is  typical  to  see 
DDD  and  PPP  meet  all  deadlines  at  95%  bottleneck  utilization  and  to  handle  100%  with  one  postpone¬ 
ment. 

There  is  one  caution  that  should  be  mentioned.  The  pilot  study  consisted  of  data  sets  which  tended 
to  be  characterized  by  small  tasks.  That  is,  the  largest  task  utilization  divided  by  the  task  set  utilization 
was  small,  say  20%.  This  situation  tends  provide  a  favorable  scheduling  environment,  one  in  which 
high  utilizations  are  possible.  The  sophisticated  algorithms  tend  to  offer  the  most  benefit  when  the 
task  set  is  unbalanced,  that  is  when  several  tasks  have  tight  timing  requirements,  then  a  sophisticated 
algorithm  may  be  necessary  to  handle  the  scheduling,  even  at  moderate  utilizations.  When  the  task 
set  is  balanced,  slight  scheduling  errors  caused  missed  deadlines  only  at  relatively  high  utilizations. 

5.4  Conclusions 

The  finding  presented  in  the  previous  section  must  be  regarded  as  tentative.  In  1986,  we  intend  to 
carry  out  a  much  more  thorough  exploration.  Furthermore,  we  intend  to  investigate  other  issues 
liiclucling: 

«  Aperiodic  response  times, 

o  Algorithmic;  bi.havior  under  transient  overload  conditions,  and 

«  The  effnets  v/liich  occur  '.vhen  input  DM.''-,,  CPU,  and  output  DMA  conflict  in  some  '.vay, 
say  tfirough  "cycle-stealing" . 

In  spite  of  the  t,  ntative  nature  of  tlie  study,  tlio  conciusion.s  drawn  in  tfie  prcvioti.s  sc.jtion  have 
im(;ntt;\nt  inudK.ati  jivs  fer  hard'.v.tre  and  .sofiv/'.-m,.  designs.  Those  include; 

.5  The  r;  t  !  ctonic  r-irail  !  he  vord  mi  hie  I/O  ch.'Uiiiel  proem  ■  oe'.,  its  b, e:  !,d' - .vn 
iitiii/.-.iiijn  ir,  ,  .]  /  I'lgi'i,  .’!■  I  it  is  I'd  i. to  irni.l- -m.'iit.  IT-  iMiiiM  gc.ir,.',  ,'i 

m •m ,  : '  ,[ii .1, ,  I  ' !  ;’i  liiim  arr  In  •  ! ;  i  ' m.v ifm- !  I ,  /  id  ■  r  n^t  ( /  ti v  r  'ir  ;  I'l -m- .n;  -li, i 


422 


•  The  existing  I/O  channel  processor  architecture  found  in  most  computers  makes  the 
implementation  of  I/O  scheduling  difficult.  Such  architecture  should  be  modified  to  at 
least  to  support  the  rate  monotonic  scheduling  algorithm. 

•  When  scheduling  I/O,  the  DMAs  should  be  made  as  preemptable  as  possible,  for  ex¬ 
ample  by  using  a  quanta  transfer  method. 

•  Sophisticated  algorithms  are  useful  when  large  tasks  must  be  scheduled  in  a  single  task 
set  (large  tasks  are  single  tasks  with  little  slack).  Simple  algorithms  are  perfectly  ade¬ 
quate  when  there  are  many  small  processes  to  be  scheduled  even  if  their  total  utilization 
is  high.  In  addition,  the  emulator  clearly  shows  that  when  a  task  set  contains  some  large 
tasks,  then  effects  such  as  the  priority  inversion  created  by  interrupts  can  have  untoward 
effects  on  scheduling,  i.e,  interrupts  can  cause  deadlines  to  be  missed. 

•  Deadline  postponements  help  in  allowing  great  utilization  with  no  missed  deadlines.  They 
do,  however,  entail  additional  buffering  and  can  cause  increased  phase  delay.  Their 
greatest  value  seems  to  be  in  smoothing  out  transient  overloads  without  missing  any 
deadlines.  This  point  will  be  explored  more  fully  next  year. 


423 


[Lawler  81  ] 

[Leung  80] 

[Liu  73] 

[Martel  82] 

[Mok83] 


References 


Lawler,  E.  L. 

Scheduling  Periodically  Occuring  Tasks  on  Multiprocessors. 

Information  Processing  Letters  ,  February,  1981. 

Leung,  J.  Y.  and  Merril  M.  L. 

A  Note  on  Preemptive  Scheduling  of  Periodic,  Real  Time  Tasks. 

Information  Processing  Letters  ,  Nov,  1980. 

Liu,  C.  L.  and  Layland  J.  W. 

Scheduling  Algorithms  for  ^Multiprogramming  in  a  Hard  Real  Time  Environment. 
JACM ,  1973. 

Martel,  C. 

Preemptive  Scheduling  with  Release  Times,  Deadlines  and  Due  Times. 

JACM  ,  July,  1982. 

Mok,  A.  K. 

Fundamental  Design  Problems  of  Distributed  Systems  For  The  Hard  Peal  Time 
Environment. 

PhD  thesis,  M.l.T,,  1983. 


424 


4.8  Modular  Concurrency  Control  and  Failure  Recovery 


425 


Modular  Concurrency  Control  and  Failure  Recovery 
—  Consistency,  Correctness  and  Optimality 


Lui  Sha 

Departmenl  of  Electrical  and  Computer  Engineering 
Carnegie- Mellon  University 


This  work  was  sponsored  in  pan  by:  the  USAF  Rome  Air  Development  Center  (RADC)  under  contract 
F30602-8TC'0297;  the  US  Naval  Ocean  Systems  Center  (NOSC)  under  contract  N6600T8TC-0484;  and  the 
IBM  Corporadon  under  contract  YD-289658.  The  views  and  conclusions  contained  in  this  document  are 
those  of  the  author  and  should  not  be  interpreted  as  representing  the  official  policies,  either  expressed  or 
implied,  of  RADC,  NOSC  or  IBM. 


426 


Acknowledgements 

Ihis  thesis  would  not  liavc  existed  were  it  not  fttr  nid  and  eoinfort  from  iiianv  sources.  I  can  hope  to 
mention  only  a  few  important  names,  lo  all  of  the  others  who  have  helped  me.  please  accept  my  thanks. 

I  would  like  to  tliank  my  advisor.  Prof.  H.  Douglas  Jensen,  for  his  guidance,  encouragement  and  support. 
His  vision  of  a  decentralized  computer  system  has  created  the  Archons  project,  which  provides  us  with  both  a 
new  and  exciting  area  of  research  and  a  stimulating  and  productive  working  environment.  I  commend  Prof. 
John  P.  i.ehcaezky  for  his  great  endurance  in  reading  many  poorly  written  drafts  of  my  early  ideas  and  for  his 
prompt  response  to  each  of  my  attempts  to  clarify  those  ideas.  His  penetrating  comments  are  a  source  of 
inspiration  to  me.  Prof.  Martin  S.  Mckendry's  in.sight  into  the  .system  aspects  of  concurrency  control  is  always 
clear  and  fresh.  Although  I  met  him  late  and  have  had  relatively  few  discussions  with  him.  every  discussion 
led  to  new  clarification  of  my  ideas.  It  is  a  great  pleasure  to  thank  Prof  John  Shen  lor  encouraging  me  to  do 
independent  research  and  fur  providing  me  with  many  advices  and  support. 

I  would  like  to  thank  Dr.  Hide  Tokuda,  Raymond  Clark  and  Douglass  I.ockc  for  many  productive  discus¬ 
sions  of  ideas  related  to  my  theory  and  their  object  oriented  approach  of  designing  tlic.  decentralized  operat¬ 
ing  system  ArchOS.  Such  discussions  are  particularly  helpful  in  providing  guidance  for  future  work.  I  would 
also  like  to  thank  Prof  Richard  Rashid  and  Prof  Alfred  Spector  for  their  helpful  comments  and  to  thank 
Cy  ntliia  Hibbard  for  teaching  me  tlie  skills  of  writing. 

My  special  thanks  goes  to  my  wife,  Shirley,  who  is  always  most  understanding.  Without  her  encourage¬ 
ment  and  support,  I  probably  would  never  have  made  it,  I  would  also  like  to  thank  my  parents  for  their 
unending  confidence  in  my  ability  to  accomplish  important  goals  in  life,  and  for  providing  a  strong  incentive 
to  do  so. 


427 


Abstract 

A  distributed  computer  system  cifTcrs  titc  potential  I'or  a  degree  of  concurrency,  modularity  and  rcliiibility 
higher  than  that  which  can  be  achieved  in  a  centralized  system.  'I'o  realize  this  potential,  we  must  develop 
provably  consistent  and  correct  scheduling  nilcs  to  control  the  concurrent  execution  of  transactions.  Kurther- 
morc,  we  must  develop  failure  recovery  rules  that  ensure  the  consistency  and  cotrccincss  of  concurrency 
control  in  the  face  of  system  failures.  Finally,  tlicsc  scheduling  and  recovery  rules  should  support  the 
modular  development  of  system  and  application  software  so  that  a  transaction  can  be  written,  modified, 
scheduled  and  recovered  from  system  failures  independently  of  otlicrs. 

To  realize  these  objectives,  we  have  developed  a  formal  theory  of  modular  scheduling  rules  and  modular 
failure  recovery  aiics.  This  theory  is  a  generalization  of  the  classical  works  of  scrializiibility  theory,  nested 
transactions  and  failure  atomicity.  In  addition,  this  theory  addresses  die  concepts  of  consistency,  correctness, 
modularity  and  optimality  in  concurrency  control  and  failure  recovery,  nils  theory  also  provides  us  with 
provably  consistent,  correct  and  optimal  modular  concurrency  control  and  failure  recovery  rules. 


Currently,  this  tlteory  is  being  used  in  the  development  of  the  ArchOS  decentralized  operating  system  at 
the  Computer  Science  Department  of  Carnegie- Mellon  University  [Jensen  84,  Tokuda  851. 


428 


1 .  Introduction 


Disuibutcd  computer  systems  offer  the  potential  to  develop  a  decentralized  openiting  system  that  supports 
highly  concurrent  and  reliable  compuuitiuns  that  arc  needed  in  many  large  scale  distributed  real  time  control 
systems.  One  important  objective  of  the  ArehOS  decentralized  openiting  system  is  to  realize  this  potential  via 
its  atomic  transaction  facility  at  die  kernel  level  [Jensen  84],  I'o  this  end.  we  must  develop  provably  consistent 
and  correct  concurrency  control  and  failure  recovery  mechanisms.  Concurrency  control  mechanisms  must 
ensure  the  consistency  and  correctness  of  concurrent  executions  of  a  system.  Failure  recovery  mechanisms 
must  be  able  to  maintain  the  consistency  and  correctness  of  concurrency  control  in  the  face  of  system  failures. 
'Fhese  two  types  of  mechanisms  should  also  support  the  modular  development  of  system  and  application 
software  so  that  a  transaction  can  be  written,  modified,  scheduled  and  recovered  from  system  failures  in¬ 
dependently  of  other  tran.saction.s.  The  objective  of  this  thesis  is  to  provide  a  theoretical  foundation  for  die 
development  of  such  modular  concurrency  control  and  failure  recovery  mechanisms. 

This  chapter  is  organiz.cd  as  follows.  In  Section  1.1  we  review  related  works  and  discuss  the  main  contribu¬ 
tion  of  this  thesis.  In  Section  1.2  wc  give  an  overview  of  the  main  concepts  and  results  of  diis  diesis,  and  in 
Section  1.3  wc  describe  the  organization  of  diis  diesis. 

1.1  Related  Works  and  The  Main  Contribution  of  This  Thesis 

In  this  section,  we  review  the  important  developments  in  concurrency  control  and  failure  recovery  theories 
and  point  out  the  main  contribution  of  this  thesis.  In  Section  1.1.1,  we  review  related  works  and  in  Section 
1.1.2  we  discuss  the  main  contribution  of  this  thesis. 

1.1.1  Related  Works 

The  review  of  related  works  is  divided  into  three  parts.  In  Section  1.1. 1.1.  we  state  the  fundamental 
assumptions  used  by  concurrency  control  and  failure  recovery  theories  and  in  Section  1. 1.1.2,  we  discuss  some 
of  the  important  developments  of  serializability  theory  and  failure  atomicity.  Finally,  in  Section  1.1. 1.3  we 
examine  some  recent  development  in  non-scrializablc  concurrency  control  and  failure  recovery  theories. 


429 


1.1. 1.1  Fundamental  Assumptions 

III  coiK'iirrcncy  comn)l.  two  properties  of  trans.iciions  arc  of  (iindaincntal  importance:  eonvisteney  and 
corra'tncss.  Uy  "consistent",  we  incaii  that  die  results  of  executing  a  set  of  transtietions  s<uisfy  database 
consistency  constraints,  and  by  "correctness"  we  mean  die  satisfaction  of  die  transaction  post-conditions.  In 
lailurc  recovery,  an  important  concept  is  "clean  and  stift"  failures  llJcrnstcin  S3].  A  system  failure  is  said  to  be 
clean  and  soft  if  its  effect  can  be  modelled  by  a  network  of  computers  some  of  which  stop  ainning  and  lose 
the  content  of  their  main  memories.  However,  die  database,  including  the  portion  of  die  dauibasc  residing  in 
the  stable  storage  used  by  die  failed  computers,  is  not  affected  by  the  failures.  The  stable  storage  is  an 
idealized  model  of  a  network  of  reliable  disks  or  other  reliable  storage  devices  whose  probability  of  failure  is 
sti  small  that  such  failures  can  be  ignored  by  die  applications  in  question.  In  the  following  discussion,  we 
assume  that  each  tran.s.iction  in  the  system  is  consistent  and  correct  when  executing  alone.  Fiirthennorc.  we 
assume  that  failures  arc  clean  and  soft. 

1.1. 1.2  Serializability  Theory  and  Failure  Atomicity 

Ihc  c!a.ssical  works  on  concurrency  control  and  failure  recovery  arc  known  as  sehalizabiliiy  theory  and 
failure  atomicity  respectively  [Bernstein  79|.  Serializability  dicory  states  that  <i  schedule  is  said  to  be  serializ¬ 
able  if  the  computations  performed  by  any  transaction  under  this  schedule  arc  equivalent  to  those  under 
some  scnal  schedule.  Ihc  consistency  and  correctness  of  serializable  schedules  follow  immediately  from  the 
definition  of  serializable  schedules  and  from  the  consistency  and  correctness  assumptions  about  individual 
transactions.  The  most  commonly  used  scheduling  mechanism  for  enforcing  serializability  in  concurrency 
control  is  the  two  phase  lock  protocol  (Eswaran  76).  which  states  that  a  transaction  cannot  acquire  any  new 
lock  if  it  has  released  any  of  the  acquired  locks. 

The  failure  recovery  rule  used  in  the  context  of  serializable  schedules  is  called  failure  atomicity.  This  rule 
states  that  under  a  serializable  schedule  a  transaction  either  commits  all  its  executed  steps  at  the  end  of  the 
transaction  or  aborts  these  executed  steps.  Committing  a  sequence  of  executed  steps  means  non-divisibly 
transferring  the  results  of  computation  performed  by  this  sequence  of  steps  from  the  main  memories  of 
computers  to  the  database  residing  in  the  stable  storage.  Committing  a  sequence  of  steps  represents  ir¬ 
revocable  changes  made  to  the  database.  In  order  to  avoid  cascaded  aborts,  the  computational  results  of  a 
tfansaction  must  be  withheld  until  it  has  successfully  committed.  For  example,  if  we  use  die  two  phase  lock 
protocol  to  enforce  the  scrializ.ability  of  concurrency  control  and  want  to  avoid  cascaded  aborts,  then  the  two 


430 


ph.isc  lock  protocol  must  be  moditlcd  as  follows.  Instead  of  releasing  locks  after  a  transaction  has  acquired  <ill 
its  locks,  the  transactittn  now  must  not  relctisc  any  of  its  acquired  writeiocks  until  it  has  committed  [Wood  80). 
This  is  bccau.se  a  tran.saction  can  be  aborted  by  failures  before  it  has  coinmittcd.  I'o  release  a  writektek  before 
a  tnmsaction  has  committed  would  allow  other  transactions  to  use  tlic  results  of  a  partial  ctiinputation  that 
couid  be  invalidated  by  a  later  abort  operation. 

I  laving  discussed  the  basic  ideas  of  seriali/ability  theory  and  failure  atomicity,  we  now  review  three  impor¬ 
tant  developments  in  diis  area;  the  generalization  of  the  syntax  of  transactions  from  a  single  level  transaction 
to  nested  transactions  by  Moss  [Moss  81):  die  investigation  of  using  an  arbitrary  set  of  primitive  transaction 
steps  in  addition  to  the  commonly  used  "read  step”  and  "write  step”  by  Korth  [Korth  83);  and  the  integration 
of  seriali/ability  theory  with  the  software  engineering  approach,  abstract  dau  type,  by  Wc.hl  [Wcihl  84).  In 
the  early  work  on  seriali/ability  dictiry.  the  syntax  of  transactions  was  restricted  to  be  a  single  level  with  two 
primitive  dtiuibasc  operations  "read"  and  "write".  Moss  [.Mo.ss  81]  generalized  die  syntax  of  transactions  into 
the  form  of  nested  transactions.  Ilie  syntax  of  a  nested  transaction  has  the  stnicturc  of  a  tree,  where  the  root 
is  called  the  top  level  transaction  and  where  nodes  arc  called  sub-transactions.  'Ilie  leaves  of  die  tree 
represent  transaction  steps.  Lower  level  sub-transactions  are  invoked  by  higher  level  ones  and  sub¬ 
transactions  can  be  executed  in  parallel.  Moss  also  developed  a  lock  passing  scheme  to  carry  out  the  two 
phase  lock  protocol  in  the  context  of  a  nested  transaction.  This  scheme  states  that  a  sub-transaction  can 
acquire  new  locks,  but  it  cannot  release  locks  to  other  transactions.  At  the  end  of  execution,  a  sub-transaction 
must  pass  its  locks  either  to  another  sub-transaction  which  waits  for  the  locks  or  to  its  parent.  Eventually,  all 
the  locks  will  be  held  by  the  top  level  transaction,  which  can  release  its  locks  only  if  the  entire  nested 
transaction  has  acquired  all  its  locks.  In  other  words,  a  nested  transaction,  as  a  whole,  follows  the  two  phase 
lock  protocol:  releasing  locks  only  after  acquiring  all  the  locks  which  the  nested  transaction  needs.  Within  the 
same  nested  transaction,  sub-transactions  can  pass  locks  to  each  other  when  they  share  data  objects. 

Nested  transactions  have  three  clear  advantages  over  single  level  transactions.  First,  the  structure  of  nested 
transactions  is  similar  to  high  level  languages  which  support  the  nesting  of  sub-programs.  This  makes 
transactions  easier  to  write.  Second,  a  higher  degree  of  concurrency  becomes  possible  because  sub- 
U'ansactions  can  be  executed  in  parallel.  Finally,  the  management  of  failure  recovery  also  becomes  more 
flexible.  When  a  sub-transaction  aborts,  all  its  descendant  sub-transactions  must  also  be  aborted.  This  is 
because  the  descendant  sub-transactions  use  the  results  passed  by  their  parent  to  perform  computations  on 
behalf  of  their  parent  and  the  abortion  of  their  parent  invalidates  their  computations.  However,  the  parent  of 


431 


an  aborted  siib-iransacuon  has  die  option  citlicr  to  abort  or  to  commtic  the  computation.  Kor  example,  if  a 
parem  invokes  a  Mib-transaction  to  obtain  some  resource  from  nooc  \  in  a  network  and  this  sub-tran.saction 
aborts,  die  parent  has  die  option  cither  to  abort  die  computation  or  to  invoke  another  sub-transtiction  to  get 
die  resource  from  node  B.  At  die  end  of  the  nested  transaction,  all  the  executed  and  non-aborted  steps 
represent  die  computation  performed  by  a  nested  transaction  under  a  seriali/.ablc  schedule.  I’licse  steps  must 
be  cidier  committed  or  aborted  as  an  indivisible  unit  to  s;iiisfy  the  failure  atomicity  requirement.  In  order  to 
avoid  cuseaded  aborts,  a  nested  transaction,  like  single  level  transactions,  cannot  release  any  writciixrk  to  other 
transactions  until  it  has  successfully  committed.  Although  Moss  investigated  the  use  of  nested  transactions,  it 
was  l  ynch  [Lynch  83a],  Beeri,  Bernstein,  Goodman  and  Lai  [Beeri  83]  who  extended  the  tran.saction  model  of 
scrializability  theory  to  include  nested  tran.sactions. 

We  now  turn  to  the  subject  of  database  primitive  steps  in  the  development  of  scrializability  theory.  The 
most  commonly  used  set  of  primitive  steps  consists  of  two  steps:  a  "read  step"  and  a  "write  step". 
Kortli  (Korth  83]  investigated  the  possibility  of  using  an  arbitrary  set  of  primitive  steps.  He  as.sumcd  that  for 
each  primitive  dauibasc  step,  tlicrc  is  an  ass<x:iatcd  UkK  mode.  For  example,  we  have  "road  lock"  and  "write 
lock"  for  "read  step"  and  "write  step"  respectively.  He  studied  the  compatibility  between  lock  intnlcs.  Two 
locks  arc  said  to  be  compatible  if  two  steps  can  simultaneously  hold  tJicir  locks  on  the  same  data  object  and 
yet  the  resulting  schedule  is  still  serializable.  For  example,  "read  locks”  are  compatible,  because  under  a 
serializable  schedule  when  a  transaction  holds  a  read  lock  on  a  data  object,  the  steps  of  other  transactions  can 
still  put  their  read  locks  on  the  same  data  object  Korth  [Korth  83]  showed  that  the  "permissiveness  of  a 
compatibility  function"  is  determined  by  the  commutativity  of  the  steps  in  question.  This  means  that  the 
increase  in  concurrency  due  to  the  system  permitting  data  objects  to  be  shared  by  different  steps  is  deter¬ 
mined  by  the  commutativity  of  these  steps.  Unfortunately,  except  for  the  read  step,  it  is  difficult  to  find 
general  purpose  primitive  steps  that  commute.  For  example,  unconditional  algebraic  add  steps  commute. 
But  they  arc  not  often  useful,  because  the  values  of  a  data  object  are  often  bounded.  On  the  other  hand,  the 
conditional  add  steps  are  useful  but  do  not  commute.  For  example,  suppose  that  integer  data  object  A  is 
initially  zero  with  consistency  constraint  "0  <  A  <  100".  Step  t^  is  "If  A  >  0,  then  A  :  =  A  —  1"  and  step  t^  is 
"If  A  <  100.  then  A  ;  =  A  +  1".  The  sequences  of  steps  {tj,  t^}  and  {t^.  t^}  give  different  values  to  A  and  thus 

and  t^  do  not  commute.  In  order  to  be  commutative,  we  must  have  t^fi^fA))  =  t^lt^fA)  for  every  state  of  A. 

Due  to  tlic  scarcity  of  useful  general  purpose  commutative  steps,  the  "read  step"  and  the  "write  step"  have 
become  the  most  popular  set  of  primitive  steps  for  general  purpose  applications.  On  the  other  hand,  the 


432 


idcntifiLation  of  coinniut.itivc  ^lcps  c;in  be  iisoJ  lo  good  .idv.inugc  in  spceitic  .ipplicaiioiis.  f  or  cx.implc.  if 
we  li.ivc  d.in  objects  in  die  form  of  sets,  we  cjn  improve  conctirreney  by  utili/mg  the  f.iei  that  the  union 
operution  is  coinimitiitivc. 

A  recent  development  in  seriali/ubility  theory  by  Weihl  [Weihl  84]  is  to  integrate  seri.ili/iibility  theory  and 
failure  .itomicity  with  the  abstract  data  type  approach.  When  an  abstract  data  type  is  embedded  with 
mechanisms  which  ensure  seriaii/abilicy  and  failure  atomicity  in  conettrrent  excctitions.  Weihl  calls  it  an 
aluimciiy  Jala  type. 

Hie  compuutional  models  of  iibstract  data  type  and  of  seriali/ability  theory  are  dilfercnt.  An  abstract  daua 
type  is  a  computational  model  developed  in  tlte  context  of  lime  sharing  operation  systems.  It  consists  of  a  set 
of  objects  and  a  set  of  procedures  Lh<it  represent  die  operations  on  the  set  of  objects.  1  he  objects  of  an  abstract 
datti  type  can  be  cither  data  objects  or  abstract  d.ita  types.  K)r  example,  the  abstract  data  type  ”1  ist-of- 
Computcrs-in-Thc-Network"  consists  of  a  list  of  elements  <ind  a  set  of  operations  such  as  "insert'',  "delete" 
and  "search”.  An  clement  in  diis  list  is  the  abstract  data  type  "Comptiicr-rs-Rcsourcc-foi-Nctwork-l.cvcj- 
Sharing".  which  consists  of  dau  records  and  operations  for  reading  and  updating  die  record. 

Scrializability  theory  and  failure  atomicity  have  been  developed  in  the  context  of  database  systems.  The 
model  of  compuuition  in  this  case  is  a  transaction  system;  a  set  of  transactions  sharing  a  common  database. 
Since  the  models  of  computation  arc  different,  mechanisms  for  concurrency  control  and  failure  recovery  in 
database  systems  might  not  be  appropriate  for  atomic  data  types.  To  ensure  failure  atomicity.  Weihl  required 
that  every  "activity"  (transaction)  must  not  issue  commit  commands  until  the  execution  had  ended.  Trans¬ 
actions  satisfying  this  requirement  are  said  to  be  "well  formed"  by  Weihl  and  his  theory  dcaled  with  only  well 
formed  transactions.  Having  adopted  the  requirement  of  failure  atomicity,  it  leaves  open  the  problem  of  how 
best  to  ensure  the  scrializability  by  local  concurrency  control  mechanisms  embedded  in  atomic  data  types. 

The  two  best  known  types  of  concurrency  control  mechanisms  are  the  time  stamp  based  methods  and 
variations  of  the  two  phase  lock  protocol.  The  typical  time  stamp  method  requires  each  transaction  to  get  a 
time  stamp  from  a  centralized  system  counter.  The  local  scheduler  of  each  node  will  serve  the  requests  of 
transactions  according  to  the  order  of  their  time  stamps.  For  example,  suppose  that  at  node  .A  there  arc  two 
requests  from  transactions  T^  and  T^  with  time  stamps  1  and  5  respectively.  Transactions  T^  and  T^  will  be 
processed  according  to  their  time  stamp  order  1  and  5.  However,  if  transaction  T^  with  time  stamp  2  arrives 


433 


,,l  iUidc  \  Jill  iiiu  tile  l'xociUidii  lit 'j. Ills,, 1  tl'.c..  ii -.iis-iCiii'ii  1  41  hc  ,.iH  I  Ll’ J.  1 ; .iiis.iLUuns  1'  and  1 

•  '  y  / 

Aill  Me  pr^vcssod  accorjiiie  to  ilicir  nmc  stamp  iM\icr  2  and  5,  Ifiiiii  ot' order  icqiicsts  are  inrrcquciu.  a 
::ason.:i'ie  iiorKaaianec  can  be  ae:ne\od  b\  the  time  stamp  mciliod.  1  he  bohasior  of  tunc  suimp  based 
'netltods  IS  eliaractcri/cd  bv  Weihl  as  smui  iUonuc'iv.  inhere  "static"  refers  to  tlic  prc-dciermmcd  total 
.  rdc'ing  of  Lime  stamps  Aiiicii  is  used  for  coneurreney  ctniirol. 

Mtcrnati'-oly,  one  ;an  use  two  phase  lock  btised  protocols  as  cnihcdded  local  concarrcncy  control 
mechanisms.  For  example,  each  operation  of  an  atomic  data  type  can  be  viewed  as  a  sub-transaction  of  a 
nested  transaction.  We  can  then  follow  Moss'  lock  passing  scheme  to  ensure  the  scriali/ability  of  concurrency 
control.  I  he  behavior  of  twii  phase  lock  based  proUK'ols  is  characterized  by  W'eihl  as  dynamic  aiomicuy.  The 
word  "dynamic"  refers  to  the  fiet  that  the  serial  schedule  which  is  equivalent  to  the  rcali/cd  scnali/able 
xlicdiile  IS  not  prc-dciermincd.  In  addition.  Weihl  also  investigated  the  behavior  of  the  hybrid  ease  of  time 
o.iinp  ind  two  phase  lock  protocols,  whose  behavior  is  charactcri/cd  as  hybrid  ainmicii}.  Weihl  coneludcd 
that  .ill  three  typos  of  atomicity  are  "optimal"  in  the  sense  that  none  of  the  three  is  ''tnctly  inferior  to  the 
others  :n  terms  of  concurrency.  To  enhance  U)c  concurrency  of  atomic  data  types.  Weihl  also  investigated 
methods  that  support  die  use  of  user  spea/led  commMMvc  operations  in  die  context  of  atomic  data  types. 

In  sammar/.  the  classical  works  on  concurrency  control  and  lailure  recovery  are  senalizability  theory  and 
r'jiiurc  atomicity  Senaii/abic  schedules  create  a  vinual  "executing  alone"  environment  which  is  sufficient  for 
•ransaetions  to  execute  consistently  and  correctly.  .At  the  end  of  the  execution  of  a  transaction,  the  failure 
iCom'cii;.  m;c  eit.hcr  nen-dr. isibly  transfers  the  results  of  the  computation  to  the  database  or  aborts  the 
■jansaction  and  re-starts  it  later.  In  this  way,  senalizability  theory  and  failure  atomicity  ensure  the  consistency 
and  correctness  of  concurrency  control  in  the  face  of  dean  and  soft  system  failures.  However,  the  concur¬ 
rency  offered  by  senalizability  theory  and  failure  atomicity  is  quite  limited,  because  a  transaction  must 
withhoid  all  the  results  of  its  computation  until  it  has  completed  its  computation  and  successfully  committed. 
Otherwise,  one  must  be  prepared  to  deal  with  the  highly  undesirable  event  of  cascaded  aborts.  The  com¬ 
monly  used  set  of  pnmiuve  steps  consists  of  the  "read  step"  and  "write  step".  Due  to  the  scarcity  of  useful 
zcncral  purpose  commutative  steps,  except  in  some  specific  applications,  it  is  difficult  to  find  a  richer  set  of 
primiuvc  steps  wnich  would  significantly  enhance  the  concurrency  of  senalizable  schedules.  Nonetheless,  the 
development  of  senalizability  theory  represents  a  milestone  in  the  history  of  distributed  computer  systems, 
because  tins  theory  provides  a  theoretical  foundation  upon  which  pruvably  consistent  and  correct  concurrency 
.onto  i  and  failure  recovery  mechanisms  can  be  dcsicncd. 


434 


1 . 1 . 1 .3  Non -Serializable  Concurrency  Control  and  Failure  Recovery  Theories 

I'hc  limited  concurrency  olTercd  hy  seri.ili/ulnlity  theory  hus  prompted  rc'-e.ii  hers  to  iinestig.iic  non- 
seriuli/able  concurrency  control.  Ilic  fundamciiial  dit'fercnce  between  seriaii/.ibic  and  non-scriali/ablc  con¬ 
currency  control  is  diiit  the  latter  accepts  scheduies  which  arc  no!  equivalent  to  any  serial  schedule  but  are 
both  consistent  and  correct.  Kunp  and  Papadimiiriou  |Kung  79j  showed  duit  sci iaii/ability  theory  is  optimal 
when  only  transaction  syntax  information  is  used  for  scheduling.  This  means  tiiat  we  must  use  information  in 
addition  to  the  syntax  information  of  transactions  in  order  to  obtain  a  higher  degree  of  concurrency  titan  that 
which  IS  offered  by  seriaii/ability  tlieory.  Some  researchers  have  focused  on  the  use  of  transaction  semantic 
information  in  non-seriali/abic  concurrency  control,  l-'or  example.  Molin.i  [Molina  83]  proposed  the  concept 

srtihiniicii/ly  con.v/, s/e/0  schedules.  In  his  approach,  transactions  arc  cla-ssihed  into  two  clas.scs.  One  of  the 
classes  needs  a  consistent  view-  of  die  diitab.ise  while  the  other  class  di'cs  not.  A  schedule  is  said  to  be 
:>ciiiiiiuica/ly  coitsi'nviu  if  under  this  schedule  all  the  transactions  that  require  a  eimsisient  view  will  ha\c  a 
consistent  v  iew.  He  also  developed  algorithms  Utat  produce  semantically  consistent  schedules.  However,  he 
has  not  formali/cd  the  concept  of  "docs  not  need  a  coasistent  \  lew ",  nor  has  he  dc.scnbcd  how  this  concept  is 
related  to  die  pro-  and  post-conditions  of  transaction.s.  It  is  die  users'  responsibility  to  prove  drat  schedules 
resulting  from  dteir  semantic  information  specifications  arc  both  consistent  and  correct  for  their  specific 
applications. 

Another  theoretical  development  was  I.ynch's  [Lynch  83b]  concept  of  inui/i-lcvel  aiomiciiy  for  the  utili/a- 
uon  of  transaction  semantic  information  in  concurrency  control.  In  this  approach,  transactions  arc  first 
grouped  into  different  classes  according  to  die  properties  of  their  computations.  The  permissible  ways  in 
which  the  steps  of  a  transaction  can  be  interleaved  with  those  of  others  are  specified  by  the  break  points  in 
between  transaction  steps.  The  specification  of  break  points  is  relative  to  the  classes  of  transactions.  For 
example,  when  there  arc  two  classes  of  transactions  and  C,,  dicre  will  be  diree  sets  of  break  point 
specifications.  One  set  for  transactions  within  class  C^,  one  set  for  transactions  within  class  and  one  set  for 
transactions  belonging  to  different  classes  Cj  and  C^.  Lynch  also  developed  a  hierarchical  form  for  specifying 
breaking  points.  A  schedule  that  satisfies  such  specification  is  said  to  be  muitt-level  atomic.  Lynch  developed 
algorithms  that  produce  multi-level  atomic  schedules.  It  is.  however,  the  responsibility  of  users  to  prove  that 
their  break  point  specifications  produce  multi-level  atomic  schedules  diat  arc  both  consistent  and  correct  for 
their  specific  applications. 


436 


llicic  h.i'vc  ,ilso  I'con  some  cxpcrmiciuai  v^rnks  Jud  ur.i.  .iiUTL'iir,_  cx.impk>  m  non-sen. ili/ablc 

Loncunenc;.  conuol.  \llchin  [Allchin  SZ]  JcvolojK'd  ihc  concept  of  ‘u!  munaciion  v'ruiiizdbilii}. 
Schdj.i/  [Schw.iiv  S2j  used  a  concept  c.illed  iniiiMii  non  ilcpcnJcncy  rcluium:,  to  iiiili/e  tr.!ns.iciion  semnniic 
inform. ition  in  the  construction  of  shared  atomic  d.it'.i  types.  I  hc  focus  of  their  dork  is  similar  to  that  of 
Weihl  -  the  iiuecration  of  the  abstract  data  t\pe  .ipproach  ixiLh  distributed  concurrency  controi  and  failure 
recoxeiy  mechanisms  -  however,  tlicy  do  not  restrict  diemsclvcs  to  scriali/ability  theory.  Allchin  and 
Schwarz  gave  several  interesting  examples  of  non-senali/ablc  concurrency  control,  however,  there  has  not 
been  a  fonn.ii  pri'of  of  the  consistency  and  correctness  of  non-senali/ablc  concurrency  control  based  on  the 
concepts  of  eitJicr  "end  of  trans.iction  seriaii/ability"  or  ''dependency  relations". 

Currently,  in  the  Computer  Science  Department  of  Carncgie-Vicllon  L  nivcrsity.  lokuda.  Clark  and 
i.ocke  fl'okuda  85]  are  devch'pmg  a  dccciurali/cd  opcr.iting  system  ArchOS.  which  is  an  operating  system  ot 
tite  .Archons  decentralized  computer  sysicni  project  (Jensen  83.  Jen.sen  84|.  An  imjiortaiu  feature  of  ArchOS 
IS  tliat  it  supports  transactions  at  the  kernel  level.  In  fact,  all  the  ArchOS  system  primitives  employ  trans¬ 
actions.  The  focus  of  citeir  research  effort  in  dte  area  of  ininsaciion  facility  is  on  die  development  of  a  new 
model  of comput.ititm  called  Arubjects.  This  m.odcl  integrates  our  theory  with  ti  distributed  abstract  data  type 
approach.  Since  their  work  involves  both  the  application  and  extension  of  the  results  developed  in  this  thesis, 
we  will  comment  on  tJieir  work  in  Section  8.2.  Future  Work.  Readers  who  are  interested  in  the  use  of  our 
’.heor.  in  the  development  of  a  distributed  operating  system  should  read  their  workfrokuda  85].  which 
contains  deuilcd  examples  and  eonsiderr  ons  of  system  development  that  w  ill  not  be  addressed  m  tins  diesis. 

1.1.2  The  Main  Contribution  of  This  Thesis 

Recently,  non-scrializabie  concurrency  control  has  become  the  frontier  in  the  investigation  of  concuiTcncy 
control.  Existing  experimental  and  theoretical  investigations  have  provided  evidence  that  the  semantic  infor¬ 
mation  of  transactions  can  be  utilized  to  good  advantage  in  some  specific  applications.  However,  at  tliis  time, 
no  other  non-scrializable  concurrency  control  study  has  provided  general  non-scrializablc  concurrency  con¬ 
trol  rules  that  have  been  proven  to  be  consistent,  correct  and  modular,  nor  has  there  been  a  general  theory 
that  coherently  addresses  the  relationship  among  the  concepts  of  consistency,  correctness,  modularity  and 
optimality  in  non-scrializable  concurrency  control  and  failure  recovery.  'Ehe  main  contribution  of  this  thesis 
IS  to  provide  such  a  theory  and  to  deliver  provably  consistent,  correct  and  optimal  modular  concurrency 
control  and  failure  recovery  rules. 


436 


Our  llicory  of  moduljr  concurrency  coiitiol  ond  failure  recovery  is  a  gciicrali/alion  of  nested  tr.insaclions. 
scriali/ability  theory  and  failure  atomicity.  The  essential  difference  between  our  wtrrk  and  that  of  Molina  and 
of  I  .ynch  is  our  emphasis  on  the  niodulariiy  of  concurrency  control  and  failure  recovery  management. 

Rased  upon  our  experience  with  ccntraii/cd  operating  systems,  we  can  expect  a  distributed  operating 
system  to  be  very  complex  and  to  be  written  and  modified  by  many  different  programmers  over  a  penod  of 
years.  Thus,  it  is  important  to  develop  scheduling  and  recovery  mlcs  that  support  the  modular  dcselopmcnt 
of  system  and  application  software.  Intuitively,  a  scheduling  rule  is  modular  if  it  allows  a  programmer  to 
write,  modify  and  schedule  his  transactiim  independently  of  die  rest  of  the  transactions  in  the  system.  A 
recovery  rule  is  modular  if  it  allows  a  programmer  to  specify  the  conditions  for  committing  and  aborting  the 
transaction  independently  of  the  rest  of  the  transactions  in  the  system.  For  example,  scriali/ability  theory  is  a 
modular  scheduling  mlc.  and  failure  atomicity  is  a  modular  recovery'  ailc.  ITis  is  because  both  of  diem  can 
be  enforced  by  a  programmer  who  knows  only  the  dcitiils  of  his  tran.saction.  For  example,  a  programmer  can 
use  the  two  phase  lock  protocol  to  schedule  his  transaction  and  write  down  the  commit  or  abort  conditions 
without  any  knowledge  of  the  rest  of  the  transactions  in  the  system. 

In  this  thesis,  all  our  proposed  scheduling  and  recovery  nilcs  are  proved  to  be  consistent,  correct  and 
modular.  Tliat  is.  users  of  our  rules  need  not  know  the  semantics  of  others’  transactions.  As  long  as  everyone 
can  ensure  that  his  own  transaction  is  consistent  and  correct  when  executing  alone  and  diat  he  uses  our  rules 
accordingly,  the  consistency  and  correctness  of  system  concurrency  control  will  be  ensured  in  the  face  of  clean 
and  soft  failures.  This  is.  however.  neJt  the  case  with  eidicr  Molina’s  or  Lynch’s  approaches.  Their  methods 
examine  the  details  of  a  given  set  of  transactions  and  then  determine  how  the  steps  of  one  transaction  can  be 
interleaved  with  those  of  others.  As  a  result,  the  modification  of  an  existing  transaction  or  the  addition  of  a 
new  transaction  could  invalidate  the  scheduling  of  existing  transactions.  However,  when  problems  are 
specific  and  static,  their  methods  can  be  utilized  to  produce  a  degree  of  concurrency  higher  than  that  which  is 
provided  by  our  approach.  In  the  next  section,  we  give  an  informal  overview  of  our  theory. 

1 .2  Overview  of  This  Thesis 

We  have  developed  a  formal  theory  of  modular  scheduling  rules  and  modular  failure  recovery  rules,  which 
provides  us  with  optimal  modular  scheduling  and  recovery  rules.  Furthermore,  this  new  theory  is  a 
generalization  of  scrializability  theory,  nested  transactions  and  failure  atomicity.  In  Section  1.2.1.  we  describe 


437 


die  inai’.i  ch.iracicrisiics  of  our  approach  from  a  proeramtner’s  pome  of  view.  In  Section  1.2.2.  we  review  the 
aunn  i.oneepts  ofihis  theory  and  in  Section  1.2..]  we  discuss  the  mam  rcstilLs. 


1.2.1  A  Programmer’s  View 

Fri/m  a  programmer's  point  of  viev^,  under  scriali/ability  tlicory  and  failure  atomicity  cacli  transaction  is  an 
atomic  unit  of  operations  and  the  dauibase  is  a  non-divided  unit  of  datti  objects.  In  our  approach,  both  the 
dauibase  and  transactions  arc  divided  as  follows. 

1.  The  database  is  partitioned  into  consistency  preserving  aiomic  data  sets.  As  long  as  the  consis¬ 
tency  constraints  of  each  atomic  data  set  arc  satisfied  individually,  the  consistency  of  the  database 
IS  maintained. 

2.  Hach  transaction  is  independently  partitioned  into  a  partially  ordered  set  of  correctness  preserving 
eicauvitar}  transactions.  As  kmg  as  the  post-condition  of  each  c.xccutcd  elementary  transaction  is 
individually  satisfied,  the  post-conditions  of  the  partitioned  transaction  (called  a  cuntpounJ 
transaction)  arc  also  satisfied. 

3.  ilic  Steps  of  each  elementary  transaction  arc  partitioned  into  transaction  ADS  .segments,  each  of 
which  is  die  set  of  all  die  steps  in  an  elementary  transaction  accessing  the  same  atomic  data  set.  If 
the  steps  of  a  trans<iction  ADS  segment  arc  not  interleaved  with  that  of  any  other  transaction  ADS 
segments,  then  this  segment  is  an  atomic  commit  segment.  Interleaved  transaction  ADS' segments 
arc  taken  as  a  single  atomic  commit  segment. 


Having  divided  die  database  and  transactions,  we  tlicn  employ  some  concurrency  control  protocol  to 
ensure  that  elementary  transactions  are  executed  serializably  on  a  (atomic  data)  set  by  set  basis.  In  addition, 
atomic  commit  segments  of  a  transaction  arc  committed  in  the  order  they  arc  executed.  The  computational 
result  of  an  atomic  commit  segment  is  withheld  until  it  has  successfully  committed.  One  way  to  implement 
this  is  to  associate  the  two  phase  lock  protocol  with  each  atomic  commit  segment.  In  addition,  the  writelocks 
arc  withheld  until  the  atomic  commit  segment  has  been  successfully  committed.  As  a  result  of  this  approach, 
die  following  is  true  despite  clean  and  soft  system  failures; 

1.  The  database  consistency  constraints  will  be  satisfied. 

2.  The  post-condition  of  each  transaction  will  be  satisfied. 

3.  There  will  be  no  cascaded  aborts. 

4.  This  approach  provides  at  least  as  much  concurrency  as  any  other  consistent  and  correct  modular 
concurrency  control  and  failure  recovery  method. 


438 


Ihc  conccpi  of  ini)dul.inty  in  ihc  context  of  concurrency  control  .md  fiiltirc  recovery  needs  some  explana¬ 
tion.  Inttiitively.  .1  modular  concurrency  control  method  is  one  whicli  allows  a  programmer  U)  write,  modify 
and  schedule  his  traiisaction  indeitendently  of  otiicr  cransaciitms.  A  failure  recovery  method  is  modular  if  it 
allows  a  programmer  to  write  down  the  cominit/abort  conditions  independently  of  other  transactions.  For 
example,  scriaii/ability  theory  is  a  mixlular  scheduling  mctlrod.  bccau.se  a  programmer  can  write,  modify  and 
Uicn  schedule  (e.g.  using  two  phase  lock)  his  tr.msaction  independently  of  other  transactions  in  the  system. 
Failure  atomicity-  is  a  modular  recovery  method,  because  one  can  enforce  it  on  a  transaction  by  writing  down 
the  commit/abort  conditions  independently  of  other  transactions. 

1 0  illustrate  our  approach,  let  us  consider  die  following  example.  Suppose  that  transaction  Cet-.A-and-B 
needs  one  unit  of  some  resource  at  node  A  and  another  unit  at  node  B.  If  it  cannot  have  both,  then  it  would 
rather  Ir.ue  none,  since  the  ttisk  cannot  be  run  with  only  one  of  die  resources.  1  ho  resource  heap  at  each  node 
is  modelled  as  <in  atomic  data  set  with  consistency  constraint  "0  <  HcapSi/.c  <  MaxSizc".  In  order  not  to 
bliK'k  the  resource  allocation  activity  at  any  node,  we  structure  the  transaction  as  a  compound  transaction 
Get-A-and-B.  which  is  illustniicd  in  Table  I-I. 

This  example  illustrates  die  following  four  characteristics  of  our  approach.  First,  die  database  is  partitioned 
into  two  consistency  preserving  atomic  data  sets;  Heap  A  and  Heap  B.  Second,  the  compound  transaction 
Gct-A-and-B  consists  of  four  correctness  preserving  elementary'  transactions,  which  co-operatively  carry  out 
the  task  of  die  compound  transaction  by  passing  information  to  each  other  via  atomic  variables  Obtain- A  and 
Obtain- B.  As  long  as  the  post-conditions  of  the  executed  elementary  transactions  arc  individually  satisfied, 
the  post-conditions  of  die  compound  transaction  Get-A-and-B  are  sadsfied.  Althougli  in  this  example  each  of 
the  elementary  transactions  is  a  single  Icve  Tansaction,  an  elementary  transaction  can  have  the  form  of  a 
nested  transaction.  Third,  each  elementary  transaction  has  one  transaction  ADS  segment  which  is  also  an 
atomic  commit  segment.  These  atomic  commit  segments  are  committed  in  die  order  of  their  execution.  Note 
that  this  commit  protocol  violates  die  requirement  of  failure  atomicity:  ail  executed  steps  of  a  transaction 
must  be  committed  as  a  non-divisible  unit.  Fourth,  the  locking  protocol  used  by  this  transaction  only  ensures 
that  each  elementary  transaction  is  executed  serializably  with  respect  to  each  atomic  data  seL  Global 
senalizability  of  concurrency  control  is  not  enforced.  For  example,  suppose  that  wc  have  another  compound 
transaction  Get-A-or-B,  which  is  illustrated  in  Table  1-2.  Transaction  Get-A-or-B  tries  to  get  one  unit  of 
resource  from  node  A  and  one  unit  of  resource  from  node  B.  If  it  cannot  get  both,  then  it  keeps  whatever  it 
has  obtained. 


439 


Conipound  I  rimsaction  Get-A-and-B  ; 

\t()inii\ ari:il)iL‘  Ol’iiiin-A,  Ohniin-H  :  Uoolcan; 

HcsinScrial 

Ik-j-iiil’anillcl 

1“  IcnifnlaryTraiisartion  Get-A ; 
lk‘"inS<;rial 

Wrilclock  Heap  A; 

Take  a  unit  from  A  if  available 

and  indicate  the  result  in  atomic  variable  Obiain-A. 
Coniniit  and  unlock  Heap  A; 

KndSerial ; 

Klcmciitary  Transaction  Get-B ; 

IkyinSerial 

Writcl.ock  Heap  B; 

Take  a  unit  from  IS  if  available 

and  indicate  the  result  in  atomic  variable  ObiatH-B; 
Coiiiinil  and  unlock  Heap  B; 

1  iidScrial ; 

Knd  Parallel; 


ISc'^inl’arallcI ; 

Klcnicntary Transaction  ?u t-6ack-A ; 
lSc<j>iiKScrial 

If  Obiain-A  and  not  Obiain-B 
then  BeginSerial 

WTitel.ock  Heap  A; 

Put-Back  the  unit  of  A; 
Commit  and  unlock  Heap  A; 
I-’ndScrial ; 

KndSerial ; 

Klenientary'Transaction  Put-Back-B ; 
BcginScrial : 

If  Obtain- B  and  not  Obtain- A 
then  BcginSerial 

WritcLock  Heap  B; 

Put-Back  the  unit  of  B; 
Commit  and  unlock  Heap  B; 
EndSerial ; 

EndScrial : 

EndParallcl ; 

EndSerial. 


Tabic  l-l:  Compound  Transaction  Get-A-and-B 


Suppose  that  we  execute  transactions  Get-A-and-B  and  Get-A-or-B  with  Heap  A  and  Heap  B  with  initial 
values  of  one.  It  is  possible  that  transaction  Got-A-and-R  gets  the  only  unit  of  A  first  while  trans.iction 
Get-A-or-B  gets  the  only  unit  of  B  first.  As  a  result,  transaction  Get-A-and-B  will  put  back  the  obtained  unit 


440 


('onipound'l'ransuction  Get-A-or-B ; 
Ri'itiiiPaniilcI 

Klcmcniary  I'ransaction  Ge  t- A ; 

Ik^inScrial 
Writclock  Heap  A; 

Take  a  unit  from  A  if  available; 
Commit  and  unlock  Heap  A; 

KndScrial ; 

Klcmcntary  Transaction  Get-B; 

BcginScrial 
Writclx>ck  Heap  B; 

Take  a  unit  from  H  if  available; 
Commit  and  unlock  Heap  B; 

KndScrial ; 

Knd  Parallel . 


Table  1-2:  Compound  Trans<Jction  Get-A-or-B 

to  A  while  transaction  Get-A-or-B  will  keep  the  unit  obtained  from  B.  The  result  of  tliis  execution  is  mi 
equivalent  to  executing  these  two  transactions  serially.  When  these  two  transactions  are  executed  serially,  one 
of  them  will  get  both  units. 


Despite  the  violation  of  both  scrializability  theory  and  failure  atomicity,  the  consistency  and  correctness  of 
concurrency  control  will  be  preserved  in  the  face  of  clean  and  soft  failures.  That  is.  die  sizes  of  Heap  A  and 
Heap  B  will  be  positive  and  within  their  bounds,  transaction  Get-A-and-B  will  ciilicr  get  both  units  or 
nothing  and  transaction  Get-A-or-B  will  get  both  units,  one  of  the  two  units  or  nothing.  For  example,  if 
compound  transaction  Get-A-and-B  gets  one  of  the  two  units,  commits  and  then  fails  later,  it  can  resume  its 
execution  consistently  and  correctly  later  by  either  returning  the  obtained  unit  or  getting  another  unit. 
Furthermore,  since  elementary  transaction  Get-A  (Get-B)  leaves  the  database  in  a  consistent  state,  transaction 
Get-A-or-B  will  execute  consistently  and  correctly  despite  the  failure  of  Get-A-and-B. 

The  consistency  and  correctness  of  compound  transactions  like  Get-A-and-B  and  Get-A-or-B  is  not  an 
isolated  example  but  rather  the  result  of  satisfying  the  generalized  setwise  serializable  scheduling  rule  and  the 
failure  safe  rule  developed  in  this  thesis.  Scrializability  theory  and  failure  atomicity  arc  shown  to  be  special 
eases  of  them. 


From  an  application  point  of  view,  the  partition  of  the  database  and  of  transactions  increases  the  system 


441 


concurrciKV  and  reduces  the  probability  of  deadlocks.  Kor  example,  heaps  A  and  H  arc  assumed  to  be  heavily 
shared  daui  objects,  it  is  generally  a  good  practice  to  minimize  the  duration  of  leaking  on  such  data  objects. 
Compound  trans.itiion  Gei'A-and-B  is  written  with  tltis  objective  in  mind.  We  accomplish  diis  objective  by 
dividing  pt)ssiblc  operations  on  atomic  data  set  Heap  A  (Heap  B)  into  two  elementary  transtictions:  Get-A 
(Get-B)  and  Pul-lkick-A  (Put-Back-B).  so  that  Heap  A  (Heap  B)  is  locked  only  when  it  is  directly  manipu¬ 
lated.  If  Get-A-and-B  is  coded  as  a  nested  transtiction  and  if  a  unit  of  A  (B)  has  been  obtained,  tlic  lock  on 
Heap  A  (Heap  B)  must  be  kept.  Keeping  die  liKk  t)n  Heap  A  (Heap  B)  in  one  node  of  the  network,  the 
nested  transijction  must  obtain  the  lock  on  Heap  B  (Heap  A)  in  another  node  and  determine  whether  it  can 
get  units  of  both  A  and  B  and  commit,  or  cannot  get  both  and  abort  'fhese  two  locks  can  be  rclctiscd  only 
after  the  nested  iransiiction  has  either  committed  or  aborted. 

hram  a  sy.sicm  performance  point  of  view,  it  can  be  quite  time  consuming  to  wait  for  the  I(x:k  on  a  remote 
data  object  in  a  network.  When  the  nested  tranviction  is  waiting  to  operate  on  Heap  B  (Heap  A),  it  blocks  die 
resource  allocation  activity  on  Heap  A  (Heap  B).  If  a  nested  transactions  is  used  to  perform  operations  on 
many  heavily  shared  data  objects  distributed  in  a  system,  the  blockage  caused  by  the  nested  transaction  could 
be  both  large  in  extent  and  prolonged  in  duration.  Recall  that  in  the  Get-A-and-B  example,  the  structure  of 
compound  transactions  allows  us  to  lock  the  heavily  shared  resource  heap  only  when  it  is  directly  operated 
upon.  The  partition  of  the  database  into  small  units  of  atomic  data  sets  and  the  partition  of  a  transaction  into 
several  small  elementary  transactions  not  only  improve  the  system  concurrency  but  also  reduce  the  prob¬ 
ability  of  the  transaction  involving  in  a  deadlock.  Therefore,  the  development  of  compound  transactions  may 
have  important  implications  in  real  time  control  systems,  in  which  an  event  can  trigger  tasks  that  require 
resources  from  many  distributed  system  elements  such  as  signal  processors,  communication  bandwidth,  and 
device  controllers. 

1.2.2  Basic  Concepts 

ihe  four  basic  concepts  concerning  concurrency  wonuoi  and  failure  recovery  addressed  in  this  work  are 
consistency,  correctness,  modularity  and  optimality. 


442 


1.2. 2.1  Consistency  and  Correctness 

In  concurrency  control,  two  properties  of  transactions  arc  of  fundaincntnl  importance;  consistency  and 
correctness.  By  "consistent",  we  mean  the  satisfaction  of  the  database  consistency  constraints,  and  by 
"correctness"  we  mean  die  satisfaction  of  tJte  transaction  post-conditions.  In  failure  recovery',  an  important 
concept  is  "clean  and  soft"  failures  [Ikrnstcin  83].  A  system  f.iilure  is  said  to  be  clean  and  soft,  if  its  effect  c;in 
be  modelled  by  a  network  of  computers,  some  of  which  stop  running  and  lose  the  content  of  their  main 
memories.  However,  the  database,  including  the  portion  of  die  database  residing  in  die  suiblc  storage  used  by 
the  failed  computers,  is  not  affected  by  die  failure.s.  We  assume  that  each  transaction  is  consistent  and  correct 
when  executing  alone  and  diat  all  failures  arc  clean  and  soft.  Under  these  assumptions,  a  schedule  is  said  to 
be  consistent  and  correct  if  it  executes  each  transaction  consistently  and  correctly.  A  failure  recovery  ailc  is 
said  to  be  consistent  and  correct  if  the  committed  results  of  executing  a  set  of  transactions  in  the  face  of 
system  failures  is  equivalent  to  die  rcsulLS  of  an  execution  produced  by  a  consistent  and  correct  schedule  of 
the  same  set  of  transactions  when  dierc  is  no  failure. 

We  have  mentioned  that  serialisable  schedules  are  consistent  and  correct.  However,  scriali/ablc  schedules 
constitute  only  a  subset  of  all  the  possible  consistent  and  correct  schedules.  In  diis  work,  we  study  consistent 
and  correct  scheduling  ailcs  whose  associated  schedules  are  supersets  of  the  serializable  schedules.  That  is, 
these  rules  permit  a  higher  degree  of  concurrency  by  including  non-scrializable  schedules.  Under  non- 
scrializablc  schedules,  transactions  cannot  be  regarded  as  executing  alone.  The  consistency  and  correctness  of 
such  scheduling  rules  is  not  obvious  and  must  be  proved. 

In  this  work,  we  also  study  consistent  and  correct  failure  recovery  rules  which  provide  sufficient  conditions 
upon  which  the  consistency  and  correctness  of  the  concurrency  control  is  preserved  in  the  face  of  clean  and 
soft  system  failures.  The  recovery  rule  as.sociated  with  serializability  theory  is  known  as  failure  aiomicity. 
which  specifies  that  a  transaction  either  commits  all  its  executed  steps  at  the  end  of  its  execution  or  aborts 
these  executed  steps.  We  have  developed  a  consistent  and  correct  recovery  rule  that  specifies  die  conditions 
under  which  the  computations  performed  by  a  transaction  can  be  committed  part  by  part.  In  other  words, 
our  recovery  rule  performs  distributed  "check-points"  so  that  transactions  not  affected  by  failures  execute 
consistently  and  correctly  and  transactions  interrupted  by  failures  can  resume  their  compulations  consistently 
and  correctly. 


443 


1.2. 2. 2  Modularity  and  Optimality 

R.iscd  upon  our  experience  wiih  ceiuruli/cd  openiling  systems,  we  can  expect  a  distributed  operating 
system  to  be  ver,  complex,  and  to  be  wiuien  and  modified  by  many  different  programmers  over  a  period  of 
years.  Thus,  we  must  devcltip  scheduling  rules  supporting  tlic  mt'dular  development  of  system  .md  applica¬ 
tion  software.  Intuitively,  a  scheduling  rule  is  modular  if  it  allows  a  programmer  to  write,  modify  and 
schedule  his  transaction  independentiy  of’any  knowledge  of  tlie  rest  of  tlic  transtictions  in  the  system.  .A 
recovery  rule  is  modular  if  it  allows  a  programmer  to  specify  die  conditions  for  committing  and  aborting  the 
transaction  independently  of  the  rest  of  the  transactions  in  the  system.  For  example,  scriali/ability  tlieory  is  a 
modular  scheduling  mic  and  failure  atomicity  is  a  modular  recovery  aile.  ihis  is  because  both  of  them  can  be 
enforced  by  a  programmer  who  knows  only  his  transaction.  For  cxtimplc.  he  can  u.sc  the  two  phase  lock 
protocol  to  schedule  his  trans.iccion  and  w  rite  down  the  commit  or  abort  conditions  without  any  know  lodge  of 
the  rest  of  the  transactions  in  the  system. 

The  optimality  of  scheduling  Riles  is  modelled  by  set  conuiinmcnt.  A  scheduling  rule  is  tipiimal  under 
some  given  conditions  if  the  set  of  schedules  permitted  by  this  Rile  is  the  superset  of  the  permissible  schedules 
of  any  other  mlc  under  the  same  conditions.  The  optimality  of  recovery  rules  is  modelled  in  a  similar  fashion. 

Both  modularity  and  optimality  .ire  characterized  by  the  way  they  utilize  the  information  regarding  the 
transaction  system  and  the  database.  We  assume  that  transactions  arc  scheduled  one  at  a  time.  Tlic  infor¬ 
mation  available  to  scheduling  and  recovery  Riles  can  be  classified  as  follows; 

1.  The  consistency  constraints  of  the  database. 

2.  The  syntactic  definition  of  transactions,  such  as  the  definition  of  nested  transactions.  Note  that 
the  syntactic  definition  includes  the  definition  of  the  set  of  primitive  steps  used  by  the  trans¬ 
actions,  e.g.,  "read  step"  and  "write  step". 

3.  The  local  semanuc  information:  the  description  of  the  computation  performed  by  the  transaction 
which  is  being  scheduled. 

4.  The  global  semantic  information;  the  description  of  the  computation  performed  by  a  set  of 
transactions.  When  a  scheduling  rule  utilizes  the  semantic  information  of  any  transaction  other 
than  the  one  being  scheduled,  we  say  that  this  rule  uses  global  semantic  information. 

When  a  programmer  writes  or  modifies  a  transaction,  he  must  know  the  consistency  constraints  to  verify  the 
consistency  of  his  transaction.  In  addition,  he  must  know  the  syntax  of  transactions  and  understand  the 


444 


semantics  nf  his  transiictioii  U)  write  down  die  code  correctly.  In  otiicr  words,  when  he  is  ready  to  schedule  his 
tTiin.siiction  and  to  write  down  the  conditions  for  aborting  or  committing  his  transiiction,  he  must  use  the  first 
three  types  of  information.  I'liiis.  die  designer  of  modular  scheduling  and  recovery  rules  should  exploit  the 
use  of  such  information. 


On  the  other  hand,  the  designer  of  mtxlular  scheduling  and  recovery  rules  .should  avoid  the  use  of  global 
semantic  infonnatuin.  This  is  baause  when  one  uses  the  global  semantic  information,  the  modification  of 
one  U'ansaction  could  force  the  re-scheduling  of  other  transactions.  To  illustrate  the  problems  associated  with 
using  global  semantic  information,  let  us  con.sidcr  the  following  example.  Suppose  that  wc  have  a  database 
consisting  of  two  variables  A  and  II  with  consistency  constraint  "A  -I-  B  =  100".  ITierc  arc  two  transfer 
transactions;  T^  =  {t^  A  ;=  A  —  I;  t^ 11  :=  B  -f  1}  and  T^  =  {t^j:  B  ;=  B  -  2;  t, A  :=  A  -i-  2}. 
where  the  symbol  "t.  T  denotes  tlic  step  j  of  transaction  i.  In  addition,  there  is  another  implementation  of 
transaction  T^:  I  *  =  (t*  B  :  =  B  —  2:  t*^:  A  100  -  B}.  Note  diat  transaction  I'^.  like  T^.  transfers  two 
units  from  B  to  A  and  preserves  the  consistency  constraint  "A  -t-  B  =  100"  when  executing  alone.  Consider 
the  following  two  non'scrializablc  schedules;  Sj  s  {  tj t^j;  t,,;  tp  }  and  S,  =  {t^  ^  t*j:  t^,;  tj,  }. 
Schedule  is  the  same  as  S^,  except  that  die  steps  of  T,  arc  replaced  by  the  steps  of  I'*,.  It  is  easy  to  check 
that  Sj  is  consistent  and  correct,  while  is  inconsistent.  In  Table  1-3.  there  arc  two  examples  of  executing 
schedules  and  $2- 

An  Kxccution  ofSchediilc  S 


A  (50) 

tj  j:  A  :»  A  -  1;  (49) 

B  (50) 

t2,,:  B  :» 

B  -  2; 

(48) 

122^  A  :«  A  +  2.  (51) 

^1.2=  B 

B  +  1. 

(49) 

An_Exgcution  of  Schedule  S. 

A  (50) 

B  (50) 

tj  p-  A  ;«  A  -  1;  (49) 

•  — 

B  -  2; 

tj  j--  A  :«  100  -  B.  (52) 

(48) 

t  •  R  •  a 

1,2'  ' 

B  +  1. 

(49) 

Tabic  1-3:  Executions  of  Schedules  and  Sj 


445 


Note  that  at  the  end  of  executing  S^,  the  sum  of  A  and  H  is  101.  an  inconsistent  result.  Using  global 
semantic  information  about  both  Tp  I',  and  'I',,  a  scheduling  rule  is  able  to  accept  Sj  and  reject  S,.  However, 
this  aile  must  know  the  (/i.‘eruc//o/i.v  among  the  detailed  codings  of  these  transactions.  The  knowledge  of  tJic 
dauibasc  consistency  constramLs  and  the  post-conditions  of  these  transiictions  is  insuftkienL  lliis  is 
demonstrated  by  the  fact  that  S,  is  inconsistent,  altliough  the  post-conditions  of  T^  and  T^  arc  identical. 

When  a  scheduling  rule  relics  upon  tlic  details  of  several  transactions  to  schedule  each  of  the  transactions,  a 
slight  modification  in  one  transaction  could  invalidate  the  scheduling  of  some  other  transactions.  Tltis  could 
cause  serious  difficulties  in  the  development  of  a  general  atomic  transaction  facility,  since  new  transactions 
arc  constantly  wntten  and  existing  ones  arc  frequently  updated.  In  summary,  a  modular  rule  only  uses  tJic 
first  three  types  of  information,  and  tlic  optimal  modular  rule  makes  the  best  use  of  them.  Finally,  we  want  to 
point  out  that  when  problems  arc  specific  and  sonic,  global  semantic  information  could  be  utilized  to  good 
advanugc  [Allchin  82.  Schwarz  82.  Molina  83.  Lynch  83bJ.  Such  an  approach  is,  however,  outside  the  scope 
of  this  work. 

1 .2.3  Main  Contributions 

There  arc  four  main  contributions  in  tltis  thesis.  The  first  is  tlic  formal  model  of  modular  scheduling  mlcs. 
The  second  is  the  setwise  serializable  scheduling  rule,  which  is  a  scheduling  .ulc  which  produces  non- 
senaii/ablc  schedules,  l^his  rule  is  proven  to  be  both  consistent,  correct  and  modular;  and  it  is  the  optimal 
application  independent  scheduling  rule.  The  third  is  a  new  transaction  stmeture:  compound  iransactions  dnd 
their  associated  generalized  setwise  jeriafaa6/e  scheduling  rules.  These  rules  form  a  complete  class  within  dte 
set  of  all  the  possible  consistent  and  correct  modular  scheduling  rules.  This  means  that  for  any  consistent  and 
correct  modular  scheduling  rule,  there  always  exists  a  generalized  setwise  serializable  scheduling  rule  that 
provides  at  least  as  much  concurrency  in  scheduling.  The  fourth  main  contribution  is  the  failure  recovery  rule 
called  the  failure  safe  rule.  This  rule  is  the  optimal  failure  recovery  rule  for  compound  transactions  scheduled 
by  generalized  setwise  serializable  scheduling  rules. 

1 .2.3.1  A  Model  of  Modular  Scheduling  Rules 

Our  first  main  contribution  is  the  development  of  a  formal  model  of  modular  scheduling  rules.  This  model 
provides  us  with  a  formal  definition  of  the  intuitive  concept  of  "modularity"  in  the  context  of  concurrency 
control,  and  permits  us  to  have  a  unified  treatment  of  different  modular  scheduling  methods. 


446 


Intuicivciy.  a  scheduling  rule  specifies  how  the  steps  of  a  transaction  can  be  interleaved  with  those  of  other 
transactions.  In  our  model,  a  scheduling  rule  specifies  tlic  permissible  ways  of  interleaving  transaction  steps 
by  partitioning  the  steps  of  each  transaction  into  equivalent  classes  ctiilcd  aitmuc  sicp  sc^incnis.  Schedules 
satisfy  the  specification  of  a  given  scheduling  rule  by  interleaving  the  atomic  step  segments  of  each  trans¬ 
action  scriali/ably  with  llic  atomic  step  segments  of  other  transactions.  For  example,  scriali/ahility  theory  is  a 
particular  scheduling  rule  which  takes  all  the  steps  of  a  tran.saction  as  a  single  atomic  step  segment.  Serializ¬ 
able  schedules  arc  the  set  of  schedules  tliat  satisfy  this  rule,  because  tJic  atomic  step  segments  specified  by 
senaiizability  theory  are  interleaved  scrializably  in  serializable  schedules.  .A  scheduling  rule  is  said  to  be 
consistent  and  correct  if  and  only  if  all  the  schedules  satisfying  this  rule  arc  consistent  and  correct.  For 
example,  serializability  theory  is  consistent  and  correct,  because  all  die  serializable  schedules  arc  consistent 
and  correct. 

We  now  introduce  the  concept  of  modularity  in  concurrency  control.  A  scheduling  ailc  is  said  to  be 
modular  if  this  rule  partitions  each  transaction  independently  of  other  tninsactions  in  the  system.  For 
example,  serializability  theory  "partitions'*  the  steps  of  each  transaction  independently  of  the  rest  of  trans¬ 
actions  in  the  system,  llius.  serializability  theory  is  a  modular  scheduling  ailc.  When  a  transaction  is  being 
partitioned  by  a  modular  rule,  the  only  information  regarding  or/ier  transactions  in  die  system  is  that  they  are 
supposed  to  be  consistent,  correct  and  written  in  the  syntax  agreed  upon.  Nonetheless,  given  any  arbitrary  set 
of  such  independently  partitioned  transactions,  the  modular  scheduling  rule  diat  produces  such  partitions 
must  guarantee  the  consistency  and  correctness  of  all  the  schedules  in  which  atomic  step  segments  of  one 
transaction  are  interleaved  scrializably  with  those  of  other  transactions. 

The  degree  of  concurrency  produced  by  a  scheduling  rule  is  determined  by  the  degree  of  interleaving  the 
rule  permits,  thus  finer  partitions  lead  to  a  larger  set  of  permissible  schedules  and  a  higher  degree  of  concur¬ 
rency.  The  highest  degree  of  concurrency  is  reached  when  the  partition  is  so  fine  that  each  step  is  an  atomic 
step  segmenL  while  the  lowest  degree  of  concurrency  is  reached  when  the  entire  set  of  steps  in  a  transaction  is 
taken  as  a  single  atomic  step  segment.  However,  a  rule  can  only  partition  a  transaction  to  the  extent  that  it 
can  still  guarantee  the  consistency  and  correctness  of  the  resulting  schedules  with  respect  to  the  available 
information.  For  example,  with  only  transaction  syntax  information,  there  is  little  a  scheduling  Rile  can  do 
except  take  the  entire  transaction  as  a  single  atomic  step  segment.  Serializability  theory  partitions  each 
transaction  in  this  way.  Kung  and  Papadimicriou  [Kung  79)  proved  that  serializability  theory  is  optimal  when 
only  transaction  syntax  information  is  used  for  scheduling.  Wc  have  investigated  how  to  utilize  other  types  of 
information  that  arc  also  available  to  modular  scheduling  Riles. 


447 


Given  a  set  of  scheduling  rules,  we  arc  interested  in  knowing  which  one  provides  a  higher  degree  of 
concurrency.  To  tliis  end,  we  model  the  intuitive  concept;  "the  degree  of  interleavings  permitted  by  a 
scheduling  rule"  by  the  formal  concept  of  set  containment.  If  the  set  of  scheduies  that  satisfies  scheduling 
rule  Rj  is  a  subset  of  that  which  satisfies  R,,  then  we  say  that  R,,  is  more  concurrent  than  R^.  We  consider 
that  a  consistent  and  correct  modular  sclicduling  rule  is  optimal  under  some  conditions  if  dais  rule  provides  at 
least  as  much  concurrency  as  any  other  consistent  and  correct  modular  scheduling  rule  under  the  same 
conditions.  For  example,  seriali/ability  tlieory  is  optimal  under  the  condition  that  only  transaction  syntax 
information  is  used  for  scheduling  [Kung  79). 

The  optimality  of  scheduling  rules  is  relatively  easy  to  investigate  when  information  used  for  scheduling  is 
not  too  rich,  such  as  is  the  case  when  only  transaction  syntax  information  is  used.  We  will  also  identify  the 
optimal  scheduling  aile  when  tlie  consistency  constraints  of  the  database  and  transaction  syntax  information 
are  both  used  for  scheduling.  However,  when  the  semantic  information  of  transactions  is  also  available  for 
.scheduling,  the  identification  of  the  optimal  scheduling  rule(s)  becomes  extremely  difficulr.  In  this  case,  we 
settle  for  a  less  ambitious  goal:  completeness.  We  will  develop  a  family  of  modular  scheduling  rules  for  users. 
This  family  of  rules  is  complete  within  the  set  of  consistent  and  correct  modular  scheduling  rules.  This  means 
that  given  any  consistent  and  correct  scheduling  rule  R  ,  a  user  can  always  find  a  rule  R  in  this  family  such 
that  R  provides  at  least  as  much  concurrency  as  R  . 

1 .2.3.2  The  Setwise  Serializable  Scheduling  Rule 

Our  second  main  contribution  is  the  development  of  the  optimal  scheduling  rule  when  both  syntax  infor¬ 
mation  and  the  database  consistency  constraints  are  utilized.  This  rule  is  called  the  setwise  serializable 
scheduling  rule.  To  understand  this  rule,  we  must  first  introduce  the  concepts  of  atomic  data  sets  and  the 
consistency  preserving  partition  of  the  database.  To  perform  a  consistency  preserving  partition  of  the  database, 
we  partition  data  objects  into  disjoint  sets  in  such  a  way  that  the  consistency  of  each  set  can  be  maintained 
independently  of  the  others.  Furthermore,  the  conjunction  of  the  consistency  constraints  of  each  of  the  sets  is 
equivalent  to  the  database  consistency  constraints.  When  a  database  is  partitioned  in  this  way.  each  of  the  sets 
in  the  partition  is  called  an  atomic  data  set  (ADS). 

Intuitively,  an  atomic  data  set  is  a  collection  of  data  objects  with  a  set  of  consistency  constraints  that  can  be 
satisfied  independently  of  other  atomic  data  sets.  For  example,  in  a  distributed  computer  system  a  job  queue 
at  a  computer  waiting  for  execution  is  an  atomic  data  set.  the  local  directory  to  a  computer's  files  is  an  atomic 


448 


data  SCL  and  each  user’s  mailbox  is  also  an  atomic  data  scL  Hi  is  is  because  the  consistency  of  each  of  these 
entities  Is  defined  by  a  set  of  consistency  constraints  that  can  be  satisfied  independently  of  others.  As  a  more 
detailed  example,  suppose  that  we  have  only  two  data  objects:  a  document  queue  (Q^)  and  a  printer  queue 
(Qp)  distributed  in  two  different  nodes  in  a  local  network.  To  bound  the  queue  size,  we  specify  that  "0  < 

<  Max^"  and  "0  <  <  MaXp".  where  Max^  and  Max^  arc  the  maximal  sizes  of  and  respectively. 

These  two  inequalities  are  now  our  database  consistency  constraints.  To  model  the  "producer-consumer" 
relation  between  the  document  and  printer  queues,  we  specify  the  post-condition  for  the  file  transfer  trans¬ 
action:  the  sum  of  these  two  queues  is  not  altered  by  the  file  transfer  operation,  "Q^  -b  Qp  =  -t-  Q*" 

where  Q.  and  Q*,  i  =  d,  p.  arc  the  states  input  to  and  output  from  the  file  transfer  transaction.  In  this 
example,  we  have  two  atomic  data  sets  and  together  with  their  consistency  constraints  ”0  <  < 

Max^”  and  "0  <  Qp  <  MaXp"  respectively.  It  is  important  to  note  that  inter-related  daD  objects  can  be  in 
different  atomic  data  sets,  as  long  as  the  consistency  of  an  atomic  data  set  can  be  maintained  independently  of 
other  atomic  data  sets. 

We  are  now  in  a  position  to  define  the  setwise  serializable  scheduling  rule.  Recall  that  a  modular  schedul¬ 
ing  rule  partitions  the  steps  of  a  transaction  into  atomic  step  segments  that  must  be  executed  scrializably  by 
the  schedules  associated  with  the  rule.  Given  a  transaction,  the  setwise  serializable  scheduling  rule  partitions 
the  transaction  into  atomic  step  segments  in  the  form  of  transaction  ADS  segments.  A  transaction  ADS 
segment  is  defined  as  all  the  steps  in  a  transaction  that  access  data  objects  in  the  same  atomic  data  set.  For 
example,  the  file  transfer  transaction  above  has  two  sets  of  steps:  one  set  operates  upon  the  document  queue 
while  the  other  set  operates  upon  the  printer  queue.  Since  each  queue  is  an  atomic  data  set,  each  of  tltc  two 
sets  of  steps  is  a  transaction  ADS  segment.  We  have  proved  that  the  setwise  serializable  scheduling  rule  is 
consistent,  correct  and  modular.  In  this  work,  a  modular  scheduling  rule  that  does  not  utilize  any  transaction 
semantic  information  is  referred  to  as  an  application  independent  rule.  We  have  also  proved  that  the  setwise 
serializable  scheduling  rule  is  the  optimal  application  independent  rule.  Owing  to  the  key  role  played  by  the 
consistency  preserving  partition,  we  will  present  a  more  detailed  example  to  illustrate  this  concept  in  Chapter 
7,  An  Example  of  Application. 


449 


1 .2.3.3  Compound  T ransactions 

Hie  third  nuiin  contribulioii  of  this  thesis  is  the  dcvcli>pmcnt  of  a  new  iransjiction  structure:  compound 
iransactionx  and  its  associated  scheduling  mlcs  called  generalized  setwise  serializable  scheduling  rules.  The 
structure  of  compound  transiictions  is  designed  to  use  the  semantic  infonnation  of  a  given  transiiction  by 
partitioning  the  steps  of  a  transaction  into  a  partially  ordered  set  oi'  elementary  transactions. 

Conceptually,  the  development  of  compound  transactions  is  parallel  to  that  of  atomic  data  sets  discussed  in 
the  previous  section.  Wnile  atomic  ditta  sets  result  from  a  consistency  preserving  partition  of  the  database, 
elementary  titinsactrons  result  from  a  correctness  preserving  partition  of  a  transaction.  Given  a  database  and 
its  associated  consistency  constraints,  we  partition  the  dauibasc  into  consistency  preserving  atomic  data  sets. 
As  long  as  the  consistency  of  each  atomic  data  set  is  satisfied  individually,  the  consistency  of  the  database  is 
also  satisfied.  Given  a  tranSiiction  and  iLs  asstveiated  post-conditions,  we  partition  the  transaction  into  correct¬ 
ness  preserving  elementary  transactions.  As  long  as  the  post-condition  of  each  elementary  transaction  is 
individually  Siitisficd.  the  post-conditions  of  the  compound  transaction  arc  also  satisfied.  Each  elemenuiry 
transaction  must  preserve  the  consistency  of  die  database  and  satisfy  its  own  post-condition  when  it  is  ex¬ 
ecuted  serially  and  in  an  order  consistent  with  the  partial  ordering  of  elementary  transactions  in  the  com¬ 
pound  transaction. 

Given  a  compound  transiiction.  the  generalized  setwise  serializable  scheduling  nilc  partitions  the  steps  of 
each  elementary  transaction  into  transaction  ADS  segments.  This  means  that  as  long  as  we  execute  all  the 
elementary  transactions  setwise  serializably  and  in  an  order  consistent  with  the  partial  orderings  of  elementary 
transactions  defined  by  compound  transactions,  the  concurrent  executions  of  compound  transactions  will  be 
consistent  and  correct 

Generally,  an  elementary  transaction  can  have  the  structure  of  a  nested  transaction.  In  fact  a  compound 
transaction  becomes  a  nested  u-ansaction  when  it  consists  of  only  one  elementary  transaction.  Elementary 
transactions  in  a  compound  transaction  can  exchange  the  results  of  their  computations.  When  an  elementary 
transaction  is  executed  scrializably  and  in  an  order  consistent  with  the  partial  ordering  of  the  elementary 
transactions  in  the  compound  transaction,  it  has  the  following  two  properties.  First,  given  a  consistent  state  of 
the  database  and  the  results  passed  from  preceding  elementary  transactions,  an  elementary  transaction 
produces  another  consistent  state  of  the  database  and  satisfies  its  own  post-conditions.  Second,  the  conjunc¬ 
tion  of  the  post-conditions  of  the  constituent  elementary  transactions  is  equivalent  to  the  post-conditions  of 
the  compound  transactions. 


450 


Owing  tu  ilicso  two  propcrtioti,  Qic  locks  acquired  by  an  elementary  nans.;.. 'Km  can  be  released  at  the  end 
of  its  execution,  lliis  ability  can  be  used  not  only  for  the  enhancement  ol  concurrency  but  also  hir  the 
purpoNC  of  daui  object  encapsulation  and  proiectitm.  We  can  take  the  abstract  daui  type  iipproach  and  let  die 
elcmentarv  transactioivs  be  the  operations  of  an  abstract  data  type.  Owing  to  the  property  of  elementary 
transactions,  locks  on  critical  system  data  objects  can  be  released  ratlier  tlian  passed  to  users.  I  bis  is  impor¬ 
tant  because  once  the  locks  on  sensitive  system  objects  arc  passed  to  users,  the  system  loses  the  control  on 
these  objects  until  locks  arc  returned  to  the  system. 

In  this  thesis,  we  prove  that  gcncrali/cd  setwise  seriali/ablc  scheduling  ailcs  form  a  complete  class  within 
die  set  of  all  die  consistent  and  correct  modular  scheduling  rules.  From  an  application  point  of  view,  this 
means  that  the  approach  partitioning  die  database  and  of  writing  and  scheduling  compound  transactions 
pro\  ides  at  least  as  much  concurrency  as  any  other  modular  concurrency  method. 

1 .2.3.4  The  Failure  Safe  Rule 

Our  last  main  contribution  concerns  die  failure  safe  rule  used  in  failure  rcctivcry.  In  seriali/ability  dicory. 
die  associated  failure  recovery  rule  is  die  concept  of  “failure  atomicity",  which  requires  all  die  executed  steps 
of  a  transaction  to  be  eidier  committed  or  aborted.  In  contrast,  die  failure  safe  nilc  partitions  a  compound 
transaction  into  atomic  commit  segments  as  follows.  Take  a  transaction  ADS  segment  in  each  of  die  elemen¬ 
tary  transactions  and  examine  it  to  see  if  its  steps  arc  interleaved  with  another  transaction  ADS  segment  in  the 
same  elementary  transaction.  If  it  is  not  interleaved  with  others,  then  it  is  an  atomic  commit  segment.  If  it  is 
interleaved  with  others,  then  merge  diese  interleaved  transaction  ADS  segments  into  a  single  atomic  commit 
segment.  At  the  end  or  this  process,  we  have  created  a  partially  ordered  set  of  atomic  commit  segments.  In 
the  example  compound  transaction  Get-A-and-B.  each  elementary  transaction  is  an  atomic  commit  segment, 
because  each  elementary  transaction  consists  of  only  one  transaction  .ADS  segment. 

As  long  as  each  transaction  commits  its  atomic  commit  segments  in  a  sequence  consistent  with  this  partial 
order,  the  consistency  and  correctness  of  concurrency  control  is  ensured  in  the  face  of  system  failures.  When 
a  failure  occurs,  transactions  not  affected  by  die  failure  continue  executing  consistently  and  correctly,  while 
transactions  affected  by  the  failure  can  later  resume  their  computation  consistently  and  correctly.  For  ex¬ 
ample,  the  compound  transaction  Get-A-and-B  can  commit  its  atomic  commit  segments:  Get- A,  Get-B, 
Put-Back-A  and  Put-Back-B  in  an  order  consistent  with  die  partial  order  of  these  segments.  Suppose  that 
after  Get-A  has  been  committed,  the  compound  transaction  fails.  The  database  is  still  consistent  since  Get-A 


451 


IS  coiisisicncy  preserving,  i  herctuic,  tUlicr  tnins.ictiiins  can  ennunue  cxccuiing  consisioiulv  and  currcclly 
Wlicn  eoinpoiind  iransactiini  Got-A-and-H  resumes  its  cxccuiion.  it  can  either  execute  Get-B  to  cnin|dete  ihc 
usk  ur  execute  Pui-Back-A  if  B  is  nut  available.  In  either  case,  the  pust-conditiun  h  Gei-A-aiid-B  is  satisfied, 
in  this  diesis,  we  pruve  that  the  failure  safe  nile  is  consistent,  correct  and  modular.  In  addition,  we  also  prove 
that  this  rule  is  optimal  in  the  set  of  all  the  possible  modular  recovery  rules  that  can  be  designed  fur 
compound  tran.sactions  scheduled  by  generali7ed  setwise  seriali/ablc  scheduling  rules. 

1 .2.4  Summary 

In  this  overview,  we  have  outlined  the  characccri.stics  of  our  approach  and  discussed  four  basic  concepts 
and  four  main  contributions  of  our  theory  of  modular  concurrency  control  and  failure  recovery.  The  four 
concepts  arc  consistency,  correctness,  modularity  and  opiimaliiy.  1  he  four  mam  contributions  are  a  form.il 
model  of  modular  scheduling  rules,  the  setwise  serializable  scheduling  rule,  the  compound  transactions  and 
their  ti.s.S(x:i;iied  gcncrnli/cd  setwise  scritili/abic  scheduling  rules,  and  the  failure  safe  rule.  The  formal  model 
of  modular  scheduling  rules  ivcrmit>  us  to  have  .i  unillcd  ircaimciu  oriliffcrcin  inoJuhir  SLiiodiiling  methods. 
The  setwise  serializable  scheduling  rule  is  the  optimal  application  independent  scheduling  rule.  The  general¬ 
ized  setwise  serializable  .scheduling  nilcs  form  a  complete  class  within  die  set  of  all  tlic  modular  scheduling 
rules.  This  means  chat  for  any  given  modular  scheduling  ailc.  there  always  exists  a  generalized  setwise 
serializable  scheduling  ailc  that  provides  at  least  as  much  concurrency.  The  failure  safe  rule  is  the  optimal 
failure  recovery  rule  for  any  given  set  of  compound  transactions  scheduled  by  the  generalized  setwise  serializ¬ 
able  scheduling  rules.  Finally,  our  theory  includes  nested,  transactions,  scrializabilicy  theory  and  failure 
atomicity  as  special  cases. 

1 .3  Organization  of  This  Thesis 

We  begin  our  formal  invcscig.arion  in  Chapter  2  by  first  developing  a  formal  model  of  modular  scheduling 
rules,  [n  this  chapter,  we  first  define  the  basic  concepts  related  to  the  database  and  transactions.  We  then 
introduce  the  concept  of  a  modular  scheduling  rule.  Finally,  we  define  the  important  properties  of  modular 
scheduling  rules  such  as  consistency,  correctness  and  optimality.  This  model  provides  the  foundation  upon 
which  we  develop  our  new  transaction  structures  and  modular  scheduling  rules.  Having  developed  the  mode! 
of  modular  scheduling  rules,  we  move  on  to  the  invesiigauon  of  the  implications  of  database  consistency 
constraints  in  Chapter  3.  In  this  chapter,  we  develop  two  key  concepts  called  a  consistency  preserving 
partition  of  the  database  and  atomic  data  sets.  These  two  concepts  and  related  theorems  arc  the  basis  on 
which  we  organize  shared  data  objects  in  the  system. 


452 


Having  developed  die  model  of  modular  scheduling  mics  and  the  concepts  of  alumic  data  sets  and  of  die 
consistency  preserving  partition  of  the  dauibase.  we  develop  our  new  non-seriali/.ahle  scheduling  rules  in 
Chapter  4.  In  this  chapter,  we  introduce  the  optimal  application  independent  scheduling  rule  called  the 
setwise  serializable  scheduling  rule,  which  utilizes  die  properties  of  atomic  data  sets  for  the  purpose  of 
concurrency  control.  We  prove  diat  diis  scheduling  rule  is  consistent,  correct  and  modular.  I  ’inally.  we  prove 
that  this  rule  is  die  optimal  applictition  independent  scheduling  rule.  In  Chapter  5.  we  remove  the  require¬ 
ment  of  application  independence  so  that  we  can  obtain  a  higher  degree  of  concurrency  than  drat  which  is 
provided  by  the  setwise  serializable  scheduling  rule.  In  this  chapter,  we  develop  a  new  transaction  staicture 
called  compound  transactions  and  their  a.ss(Kiated  scheduling  rules  called  generalized  setwise  serializable 
scheduling  rules.  We  prove  that  these  ades  arc  consistent,  correct  and  modular.  We  conclude  this  chapter  by 
proving  that  these  rules  form  a  complete  class  within  die  set  of  all  die  possible  consistent  and  correct  modular 
scheduling  rules.  I  his  means  that  for  any  given  consistent  and  correct  modular  scheduling  rules,  dierc  is  a 
generalized  setwise  scheduling  ailc  which  provides  at  least  as  much  concurrency. 

Having  studied  modular  scheduling  rules,  we  investigate  failure  recovery  rules  designed  for  our  proposed 
new  transaction  stnicturcs  and  scheduling  rules  in  Chapter  6.  In  this  chapter,  we  first  define  the  concepts  of 
consistency,  correctness,  modularity  and  optimality  for  failure  recovery  rules.  Next,  we  define  a  new  recovery 
rule  called  the  failure  safe  rule,  and  we  prove  that  this  nile  is  consistent,  correct  and  modular.  We  conclude 
this  chapter  by  proving  that  this  rule  is  optimal  for  all  the  possible  modular  failure  recovery  rules  that  can  be 
designed  for  compound  transactions  scheduled  by  generalized  setwise  serializable  scheduling  rules.  Chapter 
7  is  an  example  of  a  distributed  directory  system,  which  is  used  to  illustrate  the  basic  concepts  of  this  theory. 
Chapter  8  presents  the  conclusion  of  this  thesis  and  discusses  the  future  work. 


453 


2.  A  Model  of  Modular  Scheduling  Rules 


In  Uiis  chapter,  wc  begin  our  formal  investigation  by  developing  a  model  of  modular  scheduling  rules.  This 
model  is  the  foundation  upon  which  wc  develop  new  transaction  structures  and  scheduling  rules  in  later 
chapters.  Many  basic  concepts  in  this  chapter,  such  as  "database”  and  "iranstictions”.  arc  similar  to  lliosc 
seen  in  the  databa.se  literature,  but  the  concepts  related  to  modular  scheduling  mlcs  arc  new.  Ikforc  formally 
defining  all  the  concepts  needed  in  tliis  model,  wc  give  an  ovcrv'icw  ofstime  of  tlicsc  new  concepts. 

Intuitively,  a  scheduling  rule  specifics  how  the  steps  of  a  transaction  can  be  interleaved  with  those  of  other 
transactions.  In  our  model,  a  scheduling  ailc  specifies  the  permissible  ways  of  interleaving  transaction  steps 
by  partitioning  the  steps  of  each  transaction  into  equivalent  clas.scs  called  aiomic  step  sc'^nwms.  Schedules 
satisfy  the  specification  of  a  given  scheduling  rule  by  interleaving  the  atomic  step  segments  of  each  trans¬ 
action  scrializably  with  tlic  atomic  step  segments  of  other  transiictions.  Kor  example,  scriali/ability  theory  is  a 
particular  scheduling  rule  which  takes  all  the  steps  of  a  transaction  as  a  single  atomic  step  segment.  Serializ¬ 
able  schedules  arc  the  set  of  schedules  that  satisfy  this  rule,  because  tlie  atomic  step  segments  specified  by 
scrializability  theory  arc  interleaved  scrializably  in  serializable  schedules. 

We  now  deftne  some  important  properties  of  scheduling  rules.  A  scheduling  rule  is  said  to  be  modular  if 
this  rule  partitions  each  transaction  independently  of  other  transactions  in  the  system.  For  example, 
scrializability  theory  "partitions"  the  steps  of  each  transaction  independently  of  ilic  rest  of  transactions  in  the 
system.  Thus,  scrializability  theory  is  a  modular  scheduling  rule.  A  scheduling  rule  is  said  to  be  consistent 
and  correct  if  and  only  if  all  the  schedules  satisfying  this  rule  are  consistent  and  correcL  In  order  to  discuss 
the  optimality  of  scheduling  rules,  we  need  a  way  to  compare  the  concurrency  provided  by  different  schedul¬ 
ing  rules.  Such  a  way  is  provided  by  set  containment.  For  example,  if  the  set  of  schedules  satisfying  a 
scheduling  rule  Rj^  is  a  subset  of  that  satisfying  another  scheduling  rule  R^,  then  rule  R^  is  said  to  be  more 
concurrent  than  rule  R^.  We  consider  a  consistent  and  correct  modular  scheduling  rule  to  be  optimal  under 
some  conditions  if  this  rule  is  at  least  as  concurrent  as  any  other  consistent  and  correct  modular  scheduling 
rule  under  the  same  conditions.  For  example,  serializability  theory  is  optimal  under  the  condition  that  only 
syntax  information  of  transactions  is  used  for  scheduling  [Kung  79]. 

This  chapter  is  organized  as  follows.  In  Section  2.1.  wc  define  the  basic  concepts  related  to  tltc  concept  of 
the  database.  In  Section  12.  wc  formalize  the  basic  concepts  regarding  transactions  and  their  schedules,  and 
in  Section  2.3,  we  develop  our  model  of  modular  scheduling  rules. 


454 


2.1  Database 

A  d.iUibasc  i>.  sin\pl\  .1  sh.ii'od  daui  objects  and  a  stale  of  tbe  dauibase  is  a  vector  of  the  values  ol  tliese 
shared  dtita  r)bjects.  A  consistent  state  is  a  suite  that  satisfies  a  given  set  of  consistenev  constraints.  1-or 
example,  if  we  have  a  simple  dtiLibase  consisting  of  two  queues.  Q|  and  0-,  with  consistency  constraints  "0  < 
Q|  <  100"  and  "0  <  0,  <  iOO".  tlien  a  state  is  a  description  of  the  length  of  and  the  length  of  Q.,.  A 
consistent  state  is  a  state  in  which  both  the  length  of  0,  tind  0,  are  positive  niinihers  and  less  than  or  equal  to 
100.  We  now  formalize  tJte  concepts  related  to  die  concept  of  database  by  first  defining  die  concept  of  a  data 
object  and  its  representation. 

Definition  2.1-1:  A  data  object,  0,  is  a  user  defined  smallest  unit  of  d.iui  which  is  individually  accessible 
and  upon  which  synchronization  can  be  perl'ormcd  (e.g.  locking). 

I)efiniti<>n  2.1-2:  .\s.sociatcd  with  each  data  object  O.  we  have  a  set  Dom(O).  the  domain  of  O.  consisting  of 
till  possible  values  taken  by  0. 

Definition  2.2:  Hach  data  object  is  represented  by  triplets.  <namc.  value,  version  number).  When  a  data 
object  is  created,  its  initial  value  is  assigned  to  version  zero  of  this  data  object,  e.g.  "A(0];  =  1".  When  die  data 
object  is  updated,  a  new  version  of  the  object  is  created,  and  the  transaction  works  on  this  new  version.  ITic 
version  number  w  ill  be  incremented  w  hen  the  update  is  completed. 

.As  an  example,  the  step  "A  :  =  A  + 1"  in  a  transaction  corresponds  to  die  following; 

A[v+1]  :=  A[v];  {where  v  denotes  the  current  version} 

A[v+1]  :=  A[v+1]  +  1; 

V  : =  v+1 ; 


In  the  following  discussion,  when  we  refer  to  the  current  value  of  a  data  object  O,  wc  would,  for  simplicity, 
write  "0"  instead  of  "OfvJ".  The  version  number  representation  will  be  used  when  different  versions  of  the 
values  of  a  data  object  arc  referred  to.  Having  def.ncd  the  concept  of  data  objects,  we  are  now  in  a  position  to 
define  the  concept  of  the  database. 

Definition  2.3-1:  The  system  database  D  =  (0^,  . O^}  is  the  collection  of  all  die  shared  data  objects 

in  the  system. 


455 


I  Viiniiiun  2.3-2:  A  suae  o(' J.Kab.l^c  I)  is  an  n-tupic  ^  t  i2  =  [{j  -  DomiO^). 

Dolinitidii  2. ■>-■'.•  Associated  with  database  I),  there  is  a  set  of  consistency  constraints  in  the  form  of 
predicates  on  the  suites  of  database  I).  A  consistent  state  of  database  I)  is  an  n-iuplc.  V.  satisfying  diis  set  of 
cttnsistency  constraints.  Tliis  is  indicated  by  '‘('(^ )  =  1".  vvherc  C  is  a  Boolean  fiinctiini  indicating  whether 
this  set  of  consistency  constraints  is  .s.ifisricd  by  Y.  lor  simplicity,  we  will  also  refer  to  this  set  of  consistency 
constraints  by  C.  The  meaning  of  C  is  easily  detennined  by  the  context,  nic  set  of  all  consistent  suites  of  1)  is 
denoted  by  U,  where  U  =  {>’  |  C(Y)  =  1}. 

2.2  Transactions  and  Their  Schedules 

Hj\ing  formalized  tlic  concept  of  the  database,  we  now  in\cstigatc  tlie  concepts  of  transactions  and  their 
schedules.  In  its  simplest  form,  a  transaction  is  a  sequence  of  transaction  steps.  1-iach  step  is  .i  noii-di\ isiblc 
opcnitRin  on  a  daui  object  of  the  database.  For  example,  die  trans.ictit)n  KiiqiicuciOp  deni)  =  { Writeloek  Q,; 
if  Q,  <  100  dicn  put  //e»i  at  the  tail  of  Qj  by  updating  die  appropriate  pointers;  Unlock  Q^.}  is  a  transaction 
with  a  single  wn/c  step.  'Hie  lock  and  unkK'k  pair  of  operations  is  a  mechanism  that  makes  the  complex 
operations  of  enqueue  appearing  to  be  an  indivisible  step  on  data  object  Q^. 

There  arc  different  types  of  transactions.  When  die  steps  of  a  transaction  arc  in  the  form  of  a  sequence,  this 
tninsaction  is  called  a  s/ag/c  level  transaction.  When  the  steps  of  a  transaction  arc  organized  into  a  hierarchical 
form  of  nested  sub-transactions,  this  transaction  is  called  a  nested  transaction.  Obviously,  a  single  level 
transaction  is  a  special  ease  of  a  nested  transaction.  A  nested  transaction  is,  however,  a  special  case  of  a 
compound  transaction,  which  was  informally  introduced  in  the  overview  and  which  will  be  investigated  in 
detail  in  Chapter  5.  In  this  chapter,  we  focus  on  single  level  transactions,  because  they  arc  the  simplest  to 
study  first. 


A  aansaction  is  said  to  be  consistent  and  correct  if  it  preserves  the  consistency  of  the  database  and  sausfies 
Its  own  post-conditions  when  executing  alone.  In  this  thesis,  we  study  only  consistent  and  correct  transactions. 
The  concurrent  executions  of  a  given  set  of  transactions  are  modelled  by  a  schedule,  which  is  a  total  ordering 
of  all  the  transaction  steps  of  a  given  set  of  transactions.  In  addidori,  the  steps  in  a  schedule  must  be 
consistent  with  die  orderings  of  steps  in  each  of  the  transactions.  For  example,  suppose  that  we  have  two 
transactions  =  (t^  j.  and  T,  =  {t.,  j,  t^  J.  The  sequence  of  steps  <t.^  j,  t.,  t^  t,  ,>  is  a  schedule,  hut 
the  sequence  <tj tj  j.  U  ]•  U  ^  schedule. 


456 


Assuming  ihai  c;K;h  transaction  is  consistent  and  correct  when  executing  alone,  a  schedule  is  said  to  be 
consistent  and  correct  if  under  this  schedule  each  of  the  transactions  is  executed  consistently  and  correctly.  In 
the  study  of  schedules,  anotlicr  impoaant  concept  is  called  the  equivalence  of  schedules.  Two  schedules  arc 
said  to  be  equivalent  if  and  only  if  they  induce  identical  ordering  of  steps  on  each  of  Uic  shared  data  objects. 
For  example,  let  schedule  /j  =  <t|  t,  ^(0|),  t^  ,(0,).  C2,(0,)>  and  schedule  /.^  =  <t|  j(Oj),  t|  ,(02). 

Schedules  Zj  and  z^  arc  equivalent,  because  in  both  .schedules  the  operations  on  data 
objects  and  can  be  represented  as  "t^  ^(tj  ^(0^))"  and  ”t, 

It  is  intuitively  clear  that  if  two  schedules  arc  equivalent  and  one  is  consistent  and  correct,  tinen  the  other 
must  be  so.  This  intuition  is  formally  proved  as  ITicorem  2.1.  As  an  example  of  equivalent  schedules,  we 
know  that  serial  schedules  are  consistent  and  correct.  By  'ITtcorcm  2.1.  all  the  schedules  equivalent  to  serial 
schedules  arc  also  consistent  and  correct.  The  union  of  die  set  of  serial  schedules  and  ils  equivalent  set  of 
schedules  is  known  as  serializable  schedules. 

Having  reviewed  the  concepts  of  transactions  and  their  schedules,  we  now  fonnali/c  die  concept  of  trans¬ 
actions  in  Section  2.2.1  and  the  concept  of  schedules  in  Section  2.2.2. 

2.2.1  Transactions 

In  this  section,  we  first  define  the  syntax  of  single  level  transactions  and  the  concepts  of  pre-  and  post¬ 
conditions  of  a  transaction.  We  then  enumerate  our  fundamental  assumptions  regarding  transactions.  In 
later  chapters,  we  will  introduce  generalizations  to  the  single  level  transactions  defined  here.  In  the  following 
discussion,  the  word  "transaction”  means  "single  level  transaction"  unless  it  is  explicitly  stated  otherwise,  such 
as  "nested  transaction"  and  "compound  transaction". 

Definition  2.4:  A  single  level  transaction  T.  is  a  sequence  of  transaction  steps  (t.  t.  ....  L  ^  )  with  the 

i 

following  syntax: 

<SingleLevelTransaction>  =  BeginSerial  <StepList>  EndSerial. 

<StepList>  ::=  Step  1  <StepList>;<StepList> 

A  transaction  step  is  modelled  as  the  non-divisibie  execution  of  the  following  instructions  [Kung  79]: 


457 


K  =0, 

ij  10 


O  ;=  r  (I,  .L  . L  ) 

I..  1.  1,,  i.  -  i. 

Kj  g  i.l  1,2  g 

where  ihc  symbol  ‘'i.  ”  represents  step  j  of  transiiciion  T ;  tJic  local  variable  I.  is  used  by  step  t.  to  store 

IJ  I  ijj  IJ 

the  value  read.  TI)C  symbol  "0^  "  is  ibc  data  object  acccs.scd  (read  or  written)  by  step  t. .  and  the  symbol 
"  represents  the  computation  performed  by  step  t. .. 

In  this  model,  every  step  reads  and  then  writes  a  data  object.  A  read  step  is  interpreted  as  writing  the  value 

read  back  to  die  daui  object.  That  is,  tlic  runction  f  as.sociatcd  with  a  read  step  is  the  identity  function.  It  is 

i.j 

well  kiu)wn  that  concurrency  can  be  improved  by  using  a  richer  set  of  primitive  steps  in  addition  to  the  ’’read" 
and  "write"  steps.  We  will  address  this  point  when  we  discuss  die  issue:  of  optimality  in  Section  2.3.3, 


Having  defined  the  syntax,  we  need  to  define  the  pre-  and  post-conditions  of  a  transaction.  We  begin  with 
defining  die  input  steps  and  output  steps  in  a  transaction. 

Definition  2.5:  I.et  T,  =  {t.  ^ . }  be  a  transiiction.  I.et  data  object  O  be  die  one  accessed  (read  or 

i 

written)  by  step  1^.  Step  t.^  is  said  to  be  an  input  step  if  it  is  the  step  in  that  first  accesses  data  object  O.  Step 
L .  is  said  to  be  an  output  step  if  it  is  the  step  in  T  last  accessing  O.  ITiat  is.  for  every  data  object  0  accessed  by 

IJ  I 

T.  there  is  an  input  step  and  an  output  step  associated  with  O.  Note  that  when  tlierc  is  only  one  step  in  T. 
accessing  0.  then  this  step  is  both  an  input  and  an  output  stCR. 


Since  transactions  operate  on  the  shared  database,  they  must  be  able  to  accept  any  consistent  state  as  its 
input  That  is,  the  values  input  to  a  transaction  are  assumed  to  satisfy  die  pre-conditions  of  the  transactions  as 
long  as  these  values  come  from  a  consistent  database  state. 


Definition  2.6:  Let  0^[v  J,  j  =  1  to  k.,  be  the  set  of  values  read  by  the  input  steps  of  T,  where  v^^  denotes 
the  version  of  a  data  object  that  is  input  to  a  transaction.  Let  the  index  set  of  O.  ^[vj,  j  =  1  to  k..  be  I^.  The 
input  values  to  T.,  O  [vj,  j  =  1  to  k.,  are  said  to  satisfy  the  pre-condition  of  T.,  if  and  only  if 


3(X€UK7r,  (X)  =  0  [vJ,j  =  ltok.) 


where  v.  is  the  projection  operator.  That  is.  it  (X)  is  a  tuple  whose  elements  arc  those  of  X  indexed  by 


458 


111  other  words.  Definition  2.6  suites  tiitu  j  tnins,iction  must  function  properly  if  all  the  \ allies  input  to  die 
trans.iciion  come  from  <i  consistent  state  of  the  database.  Having  discussed  the  pre-conditions,  we  now 
turn  to  the  subject  of  post-conditions. 


I'Jcfinilion  2.7;  (  el  O  fv  J.  i  =  1  to  k  .  be  the  set  of  values  written  bv  the  output  steps  of  tran.saction  T. 

-  ,jl  p  J  ,  .11  j 

where  Vj.  denotes  the  version  of  a  data  object  output  by  die  transaction.  The  pusi-cuiidiiion  of  transaction  is 
die  specification  of  the  output  values  of '17  as  functions  of  the  input  values. 


0  [v,.]  =  g.  (O. 

ij  I  i.j  I. 


. o 


i.k 


(Vbl)j  = 


to  k.. 


Having  defined  concepts  relating  to  the  notion  of  a  transaction,  we  now  state  our  fundamental  assumptions 
about  a  transaction: 


Fundamental  Assumntions: 

•  Al  Tcnninaiion:  A  transaction  is  assumed  to  have  a  finite  number  of  steps  and  .issumed  to 
terminate. 

•  A2  Transaction  Correctness:  A  transaction  is  assumed  to  produce  results  that  satisfy  its  post¬ 
condition  when  executing  alone  and  when  the  database  is  initially  consistent. 

•  A3  Transaction  Consistency:  Given  a  consistent  suite  of  the  database,  a  transaction  is  assumed  to 
produce  a  state  that  satisfies  the  consistency  constraints  of  the  database  when  it  is  executing  alone. 


Definition  2.8:  A  transaction  '17  is  said  to  be  consistent  and  correct  if  and  only  if  satisfies  assumptions 
Al.  A2  and  A3. 


In  the  following  discussion,  we  restrict  our  attention  to  consistent  and  correct  transactions. 


2.2.2  Schedules 

In  the  previous  two  sections,  we  have  formalized  the  basic  concepts  related  to  database  and  transactions. 
We  now  develop  our  mode!  one  step  further  by  considering  the  execution  of  a  set  of  transactions.  To  this 
end,  we  introduce  the  concepts  of  transaction  systems  and  schedules. 

Definition  2.9:  A  transaction  system  'I'  is  a  finite  set  of  transactions  {T^ . T^}  operating  upon  die  shared 


database  I). 


459 


Dcl'iniiion  2.10:  A  schedule  z  for  transaction  system  I  is  a  totally  ordered  set  of  all  tlic  steps  in  the 

transaction  system  T  =  {'I'l .  1'^^}  such  that  the  ordering  of  steps  of  T..  i  =  1  to  n.  in  the  schedule  is 

consistent  with  the  ordering  of  steps  in  die  transaction  i  =  1  to  n. 


( VltM  (t  €  Z)  «  (t  €  U  T)  )]  A  [  Vcr.  6  T)  V (  (t. ..  t.  €  T.)  A  (t,  ^  >  t. .)  )  ( (t.j.  t  ^  €  z)  A  (t.  ^  >  t^^)  )] 

A  schedule  z  for  iransiiction  system  T  is  said  to  be  consistent  if  and  only  if  die  execution  off  according  to  z 
preserves  the  consistency  of  the  database  1).  ’ITiis  concept  is  formalized  as  follows. 


Definition  2.11:  Let  X  be  the  initial  suite  of  0  and  Y  be  die  suite  at  die  end  of  executing  T  according  to  z. 
Schedule  z  is  said  to  be  consistency  if  and  only  if 

z:  X€U-  YeU 


A  schedule  z  for  a  transaction  system  T  is  Siiid  to  be  correct  if  and  only  if  the  execution  of  transactions  in  T 
according  z  produces  computations  that  satisfy  their  respective  transaction  post-conditions. 

Definition  2.12:  Under  schedule  z,  let  the  values  input  to  and  output  from  transaction  T.  €  T  be  0^[V|^, .... 

Oj  ^  and  ....  (v^  respectively.  Schedule  z  is  said  to  be  correct  if  and  only  if 

i  i 

V(T,s  rxojv^  =  yo,jv,) . o,,ivj),j  =  uok,) 

Having  defined  the  basic  properties  of  a  schedule,  we  now  consider  the  relations  between  schedules  by 
introducing  the  concept  of  equivalent  schedules.  Conceptually,  two  schedules  z  and  z*  for  transaction  system 
T  are  equivalent  if  for  any  given  initial  state  of  D  the  executions  of  T  according  to  z  and  z  yield  the  same 
sequences  of  values  for  each  data  object  in  the  database  and  the  same  sequences  of  values  for  each  of  the  local 
variables  (states)  of  each  transaction  in  T.  This  is  formally  defined  by  the  partial  ordering  of  steps  induced  by 
z  and  z’  on  each  of  the  data  objects  in  D. 

Definition  2.13:  Let  O  =0  denote  step  L .  and  step  L  read  or  write  the  same  data  object  A 

t. .  T* 

ij  ijn  ^ 

schedule  z  for  transaction  system  T  is  said  to  be  equivaletu  to  another  schedule  z  for  T,  if  for  every  pair  of 
steps  t  ^  and  ^  in  z  and  z  , 


460 


V((t  .1.  eU  r)A(0,  =0,  ))l((t  .  t,  fc/.)A(l  >t,  ))«((t  .t,  €/.*)A(l  >l  ))] 

i.j  k,ni  L.  I,  i.j  k.m  ij  k.ni  i.j  k.m  ij  kjn  ^ 

j.j  k.m  j  j  j  j 

Iliiit  IS.  for  cacii  ofihc  data  objcccs  in  the  database,  O,  the  orderings  of  transaction  steps  induced  by  r  and 
by  /.  ,  t,  (  ...  (t  (O)),  arc  identical. 

K.m  I.J 


We  now  prove  an  important  theorem  which  states  that  if  /  and  /  for  T  arc  equivalent  and  if  /,  is  consistent 

M 

and  correct,  then  /  is  also  consistent  and  correct.  We  begin  our  proof  wnh  the  following  lemma. 

« 

I  emma  2.1:  f.ct  /  and  /  be  two  .schedules  for  transaction  system  T.  l  et  t  be  a  step  of  transaction  e  T. 

a  m 

I  .et  the  values  of  the  data  object  .iccc.s.scd  by  t  in  /  and  /  be  .md  respectively.  1  et  the  v.ilucs  of  the  local 

•  m 

variable  associated  w  nh  t  in  /  and  /  be  l.^  and  respectively.  Given  the  identic. il  initial  st.itc  X  to  both  /.  and 
/  ,  we  have 

Ver  6  I  )V(t  €  Id(  (0^  =  o')  A  (1.^  =  ) 

'  hat  is.  die  values  input  to  and  output  from  any  transaction  step  under  /.  arc  equal  to  tliosc  under  /  . 


Pniot:  Recall  that  the  syntax  of  a  transaction  step  is  as  follows. 


O 


;=  f  (L  . 

l.  ,  t-  ,  V-  1 
C/  1./  1.1 


It  follows  that  a  transaction  step  will  output  identical  values  under  z  and  z  if  for  any  transaction  step 
te  U  T  the  values  input  to  t  under  z  and  under  z  are  equal.  Thus,  we  need  to  show  only  that  each  transaction 
step  t  €  U  T  inputs  the  same  values  under  both  schedules  z  and  z*. 


Now  consider  the  first  step  in  schedule  z  denoted  as  t^  Step  t^  ^  must  input  the  initial  value  of  some  data 

object  in  D  denoted  as  0  .  By  the  definition  of  equivalent  schedules  and  with  the  same  initial  state  of  D. 

z.l 

step  t^  ^  must  input  the  same  initial  value  of  data  object  0^  .  Hence,  t^  ^  outputs  the  same  value  under  both  z 

« 

and  z  .  Now  consider  the  second  step  t^^  value  input  to  die  local  variable  of  t^ under  z  is  either 

the  initial  value  of  a  data  object  or  the  value  output  by  step  t^  By  die  definition  of  equivalent  schedules  and 
by  the  fact  that  t^  ^  outputs  the  same  value  under  both  z  and  z  ,  step  t^ will  input  the  same  value  under  both 


461 


*  • 

/  ;ind  /.  .  Suppose  thai  step  ^  inputs  the  siimc  value  under  schedules  /  and  i  .  hollowing  the  argument 

.ibovc,  step  ^  I  inputs  the  s;ime  value  under  z  and  z  .  By  induction,  this  lemma  is  true.  □ 

« 

Thecircm  2.1:  Let  schedule  z  be  a  consistent  and  correct  schedule  for  transaction  system  T.  If  schedule  z 
=  z.  tlien  z  is  also  consisiciu  and  correct. 

Proof;  To  show  that  z  is  correct,  we  need  to  show  dial  for  any  T.  c  'l’.  the  execution  of  I’,  under  z 

produces  correct  results.  Let  the  values  input  to  T.  be  . 0^  ^  (vj.  Let  the  values  output  by  T.  be 

i 

O.  jfV|.j . 0. [VjJ.  ITic  post-condition  of  is  the  specification  of  the  output  values  of  T.  as  some  functions 

i 

of  die  input  values:  O  [vJ  =  f  (O  ,{v.  1 . O  .  [v.  I).  j  =  1  to  k..  It  follows  from  Lemma  2.1  that  all  die 

i.j^  f*  ij  i.r  i.k.^  b’'  •’  1 

I  . 

input  values  to  and  the  'uitpiit  values  from  1'^  under  z  and  z  arc  identical.  I'hcrcforc.  the  post-condition  of  L 
must  be  satisfied  under  both  z  and  z  . 

To  show  diat  z*  is  consistent,  let  the  initial  state  of  I)  be  .XeU  and  the  final  states  of  D  that  result  from 
executing  T  according  to  i  and  /’  be  Y  and  Y*  respectively.  It  follows  the  dcfiniiioii  of  equivalent  schedules 
and  Lemma  2.1  that  Y  =  Y*.  ITius.  \  *  is  a  consistent  state.  □ 

As  an  example  of  equivalent  schedules,  a  serializable  schedule  is  defined  as  a  schedule  z  for  which  there 
exists  a  serial  schedule  z*  such  that  z  =  z'.  The  consistency  and  correctness  of  serial  schedules  directly  follow 
from  the  assumption  that  each  irans«ictiun  is  consistent  and  correct  when  executing  alone.  By  Theorem  2.1. 
serializable  schedules  arc  also  consistent  and  correct. 

2.3  Modular  Scheduling  Rules 

So  far  our  models  of  database,  transactions  and  schedules  are  similar  to  those  in  the  database  literature.  We 
now  extend  our  models  to  address  the  concepts  related  to  modular  scheduling  rules.  Intuitively,  a  scheduling 
rule  is  a  specification  of  the  permissible  interleavings  of  transaction  steps  in  a  schedule.  For  example, 
scriaiizability  theory  is  a  scheduling  rule  which  specifies  that  in  a  schedule  the  steps  of  a  transaction  must  be 
interleaved  scrializably  with  the  steps  of  other  transactions.  A  scheduling  rule  specifies  the  interleavings  of 
transaction  steps  by  partitioning  the  steps  of  each  transaction  into  equivalent  classes  called  atomic  step 
segments.  The  atomic  step  segments  of  one  transaction  must  be  interleaved  scrializably  with  those  of  other 
transactions  in  schedules  that  satisfy  this  rule.  For  example,  in  a  serializable  schedule  all  the  steps  of  a 
transaction  must  be  interleaved  scrializably  with  those  of  other  transactions. 


462 


Wc  now  define  some  important  properties  of  scheduling  rules.  A  scheduling  ole  is  ^.nd  to  be  consistent 
and  correct  if  all  the  schedules  that  satisfy  tliis  rule  are  consistent  and  correct.  .A  sc  'cduling  rule  is  said  to  l)c 
irnidular  if  it  partitions  each  transaction  into  atomic  step  segments  indcpcndeiuly  of  other  transactions  in  Uic 
system.  A  modular  scheduling  rule  is  said  to  be  application  independent  if  it  perfonns  the  partition  without 
using  the  semantic  information  of  transactions.  For  example,  scriali/abiliiy  Liicory  is  a  modular  scheduling 
rule,  because  it  '’partitions"  each  transaction  into  a  single  atomic  step  segment  independently  of  other  trans¬ 
actions.  It  is  also  an  application  independent  rule,  because  it  performs  the  "partition"  without  using  any 
semantic  information  of  transactions. 

Given  a  set  of  scheduling  mlcs.  we  arc  interested  in  knowing  which  one  provides  a  higher  degree  of 
concurrency.  To  this  end,  we  model  the  concurrency  of  scheduling  rules  by  set  containment.  If  the  set  of 
schedules  tliat  satisfies  scheduling  rule  is  a  subset  of  that  which  satisfies  R,.  then  we  say  that  R,  is  more 
concurrent  than  R^.  We  consider  that  a  consistent  and  correct  modular  scheduling  rule  is  optimal  under  some 
conditions  if  this  rule  is  at  least  as  concurrent  as  any  other  consistent  and  correct  modular  scheduling  rule 
under  the  same  conditions.  For  example,  scrializability  theory  is  optimal  under  the  condition  diat  only 
transaction  syntax  information  is  used  for  scheduling  [Kung  79]. 

The  optimality  of  scheduling  rules  is  relatively  easy  to  investigate  when  information  used  for  scheduling  is 
not  too  neb,  such  as  is  the  case  when  only  transaction  syntax  information  is  used.  We  will  also  identify  the 
optimal  scheduling  rule  when  the  consistency  constraints  of  tire  dauibase  and  transaction  syntax  infonnation 
are  both  used  for  scheduling.  However,  when  the  semantic  information  of  transactions  is  also  available  for 
scheduling,  die  identification  of  the  optimal  scheduling  rule(s)  becomes  extremely  difficult.  In  this  case,  we 
settle  for  a  less  ambitious  goal;  completeness.  This  means  that  we  will  develop  a  family  of  modular  scheduling 
rules  for  users.  This  family  of  rules  is  complete  within  the  set  of  consistent  and  correct  modular  scheduling 
rules.  This  means  that  given  any  consistent  and  correct  scheduling  rule  R  ,  a  user  can  always  find  a  rule  R  in 
this  family  such  that  R  is  at  least  as  concurrent  as  R  . 

Having  reviewed  the  concepts  related  to  scheduling  rules,  we  now  formally  define  die  concept  of  schedul¬ 
ing  rules  in  Section  2.3.1.  the  concept  of  consistency  and  correctness  of  scheduling  rules  in  Section  2  .i  2,  the 
concept  of  inouulanty  and  application  independence  in  Section  2.3.3  and  finally  the  concept  of  optimality 
and  com;  ctcncss  in  Section  2.3.4. 


463 


2.3.1  Definition  of  Scheduling  Rules 

Having  discussed  the  concepts  related  to  modular  scheduling  rules,  we  now  formalize  tlK  concept  of  a 
scheduling  rule  and  its  relation  to  schedules. 

Let  denote  tlic  set  of  all  tlie  possible  consistent  and  correct  transactions  with  m  steps.  Let  T  denote  the 
set  of  all  the  possible  consistent  and  correct  transactions,  that  is,  T  =  ,T  .  Let  P  denote  a  partition 

I  m  m 

into  atomic  step  segments  of  a  m-step  consistent  and  correct  transaction.  Let  9^  denote  the  set  of  all  the 

possible  partitions  of  a  consistent  and  correct  m-step  transaction.  Lct9be  the  set  of  all  the  possible  partitions. 

that  is,  9=  U°°  ,9  . 

m=l  m 

Definition  2.14-1:  A  scheduling  rule  for  a  uansaction  system  with  n  transactions.  R  .  is  a  function  which 

n 

takes  the  transaction  system  of  size  n  and  partitions  each  of  the  n  transactions. 


Definition  2.14-2:  A  scheduling  rule  K  is  a  function  which  takes  a  transaction  system  of  any  size  and 
partitions  each  of  tlic  transaciii)ns  in  die  system.- 

such  that  the  restriction  of  R  to  H  [*=1  Rn-  'S-  Rin"=  ,  T  =  R^.  n  =  1  to  oo 

Given  a  scheduling  rule  R,  we  must  identify  the  set  of  schedules  that  satisfy  R.  A  schedule  z  satisfies  R  if 
atomic  step  segments  of  one  transaction  are  interleaved  serializably  with  those  of  others  in  z.  This  is  formal¬ 
ized  as  follows. 

Definition  2.15:  Let  T  =  {T^ . T^}  be  a  consistent  and  correct  transaction  system,  that  is.  T  C  T.  Let  T 

be  a  transaction  in  T.  Let  r„(T )  denote  the  partition  of  the  steps  of  T.  by  R.  Let  F  be  the  set  of  all  the  atomic 

K  I  1 

step  segments  of  T  specified  by  R,  that  is.  F  =  U-j-  Let  D  be  the  database  and  1(1)  be  the  set  of 

i 

all  the  possible  schedules  for  T.  Finally,  let  a.  be  an  atomic  step  segment  of  T..  that  is.  e.  €ip(T.).  A 

I  I  t  lx  t 

schedule  z  €  Z(T)  is  said  to  be  atomic  step  segment  serial  with  respect  to  R  if  atomic  step  segments  specified  by 
R  and  belonging  to  different  transactions  do  not  overlap  in  z.  That  is. 


464 


V Cl’.,  T.  € T.  i  5^  j)  V(a.  €  IJT.))  V (<r, € r_('r.))  V{t  e  a.)  ( (t  <  J)  v (i  >  a) ) 

I  J  I  K  I  J  K.  J  1  J  J 

where  t?  and  tp  are  the  first  and  last  steps  in  a.  respectively. 

Definition  2.16:  A  schedule  /€  Z(T)  is  said  to  satisfy  scheduling  rule  R.  if  atomic  step  segments  specified  by 
R  and  belonging  to  different  transactions  are  interleaved  serializably  in  z.  'ITiat  is,  z  satisfies  R  if  and  only 

3  (z*  e  Z(T))  ( (z  s  z*)  A  (z*  is  atomic  step  segment  serial) ) 

The  set  of  all  the  schedules  in  Z(T)  that  satisfies  R  is  denoted  by  Zj^(T).  That  is, 

Zj^(T)  {z  €  Z(T)|  z  satisfies  R  } 

2.3.2  Consistency  and  Correctness 

In  the  previous  section,  we  have  defined  the  concept  of  a  scheduling  rule  and  its  relation  to  schedules.  We 
novr  define  the  key  properties  of  a  scheduling  rule;  consistency  and  correctness. 

Definition  2.17:  A  scheduling  rule  R  is  said  to  be  consistent  and  correct  if  and  only  if  all  the  schedules  that 
satisfy  R  are  consistent  and  correct, 

V(T  C  T)V(z  €  Zj^(T))(z  is  consistent  and  correct) 

where  T  is  the  set  of  all  the  consistent  and  correct  transactions. 

In  the  following,  we  limit  our  discussions  to  consistent  and  correct  scheduling  rules. 

2.3.3  Modularity  and  Application  Independence 

We  now  define  two  important  properties  of  a  scheduling  rule,  in  addition  to  the  concept  of  consistency  and 
correctness.  These  two  properties  arc  modularity  and  application  independence.  We  consider  a  scheduling 
rule  to  be  modular,  if  it  partitions  each  transaction  in  a  way  that  is  independent  of  all  the  other  transactions  in 
the  system. 

Definition  2.18:  A  scheduling  nile  R  is  said  to  be  modular  if  and  only  if  R  partitions  each  transaction 


independently.  That  is. 


465 


R(T, . T  )  =  (R;(T,) . R"(T  )).  n  =  1  to  oo. 

i  n  it  in 

'ITic  scheduling  mlc  for  individual  transactions.  i  =  1  to  n.  will  be  referred  to  as  the  kernels  of  the 
modular  scheduling  rule  R. 

Note  dial  the  kernels  of  a  modular  scheduling  rule  need  not  be  identical  for  different  transactions.  In  other 
words,  a  modular  scheduling  rule  can  consist  of  a  family  of  kernels.  This  allows  each  of  these  kernels  to  take 
advantage  of  the  semantic  information  of  the  given  transaction. 

We  now  turn  to  the  concept  of  an  application  independent  scheduling  rule.  Thus  far  we  have  assumed  that 
each  programmer  writes  and  schedules  his  own  transaction.  The  scheduling  is  done  witli  full  knowledge  of  the 
transaction  written  but  without  specific  knowledge  of  others*  transactions. 

To  further  simplify  the  scheduling  task,  we  would  like  to  develop  scheduling  rules  that  can  be  implemented 
by  a  simple  procedure  that  is  independent  of  the  computations  carried  out  by  transactions.  To  tiiis  end.  an 
application  independent  scheduling  rule  must  ignore  die  specific  steps  of  various  transactions.  In  other 
words,  an  application  independent  scheduling  nile  views  a  transaction  as  a  sequence  of  read  and  write  steps 
but  ignores  any  detailed  aspects  of  the  computation  carried  out  by  the  transaction.  We  formalize  our 
discussion  on  application  independent  scheduling  niles  as  follows. 

Definition  2.19-1:  Two  transactions  T^  and  T  are  said  to  be  syniaclically  identical  if  and  only  if 

1.  Transactions  T,  and  T  have  the  same  number  of  steps. 

2.  If  step  k  of  transaction  T.  reads  (writes)  data  object  0.  then  step  k  of  transaction  T.  reads  (writes) 
the  same  data  object  O. 

Definition  2.19-2:  Two  consistent  and  correct  transactions  T.  and  T.  are  said  to  be  equivalent  in  syntax  if 
and  only  if  T.  and  T.  both  satisfy  a  given  syntactic  definition. 

Definition  2.20:  A  modular  scheduling  rule  R  is  said  to  be  application  independent  if  the  kernels  of  R  are 
the  same  for  all  the  transactions  that  arc  equivalent  in  syntax.  Furthermuic.  kernel  must  produce 
identical  partitions  for  transactions  that  are  identical  in  syntax.  That  is.  let  {T.,  1  <  i  <  n}  be  a  set 
transactions  with  equivalent  syntax.  We  have 


466 


1.  R(  r, .  i;)  =  (Rj(T,) . Ri(  rj).  n  =  I  tooo. 

2.  R  (  I  )  =  R,(  1).  wlicncvcr  tr;ins;ictic)ns  T  iiiid  T  arc syni.iclicjlly  idciuical, 

r  I  1  j  •  j 

Conccpcually.  an  application  independent  ailc  is  a  prixrcdiire  (function)  tliat  can  be  applied  to  all  the 
transiictions  defined  by  a  given  syntax.  Furtlicrmorc.  tliis  procedure  views  a  transaction  simply  as  a  sequence 
of  read  and  write  steps.  For  example,  the  serializable  scheduling  rule  is  a  modular  rule  with  the  same 
scheduling  nilc  kernel  for  all  the  transiictions.  which  takes  the  cniirc  set  of  steps  of  a  transaction  as  a  single 
atomic  step  segment.  Tltis  procedure  certainly  produces  identical  partitions  for  synuictically  identical  trans¬ 
actions.  Thus,  this  rule  is  not  only  modular  but  also  application  independent. 

2.3.4  Optimality  and  Completeness 

We  have  defined  four  important  properties  of  a  scheduling  rule,  namely  consistency,  correctness, 
modularity  and  application  independence.  In  this  section,  we  come  to  address  issues  related  to  the  degree  of 
concurrency  provided  by  scheduling  rules.  We  address  dtcsc  Lssues  using  the  concepts  of  optimality  and 
completeness.  Before  fonnally  defining  these  two  concepts,  we  first  comment  on  the  implications  of  using  a 
richer  set  of  primitive  steps  in  addition  to  die  "read”  and  "write”  steps  used  in  tltis  work. 

Korth  [Korth  83]  has  shown  that  for  serializable  schedules  the  concurrency  can  be  improved  if  the  set  of 
primitive  steps  is  expanded  to  include  other  commutative  ones.  The  idea  is  that  if  a  set  of  steps  is  commuta¬ 
tive.  then  there  is  no  need  to  control  their  relative  order.  The  reason  is  as  follows.  We  know  that  if  in  schedule 
z  the  ordering  of  the  steps  on  each  of  the  data  objects  in  the  database,  t^^(  ...  (t.  .(O)),  is  consistent  with  the 
ordering  of  steps  of  some  serial  schedule  z*.  then  schedule  z  is  serializable.  Suppose  that  in  schedule  z  some  of 
the  steps  are  "out  of  order"  but  they  arc  commutative.  Without  affecting  the  results  of  computations  we  can 
always  rearrange  the  ordering  of  these  steps  in  such  way  that  the  ordering  of  these  steps  becomes  consistent 
with  that  in  a  serial  schedule.  Therefore,  scheduling  z  is  serializable.  In  the  following,  we  limit  our  discussion 
to  U’ansactions  using  only  primitive  steps:  "read"  and  "write".  The  use  of  commutative  steps  to  improve  the 
concurrency  of  (generalized)  setwise  serializable  schedules  can  be  done  in  a  manner  similar  to  that  done  by 
Korth  for  serializable  schedules. 

We  begin  our  investigation  by  first  defining  a  way  to  compare  the  degree  of  concurrency  offered  by 


different  scheduling  rules. 


467 


Dcfiniiion  2.21:  Schciiuling  rule  k'  is  said  to  be  ai  least  as  concurrent  as  denoted  by  It*  >  if  and 
only  if, 

V(TCTK'/.^:(T)CZ^i(T)) 

I'hat  is.  tlie  concurrency  of  schedules  is  partially  ordered  by  set  containment.  The  relative  concurrency  t)f 
two  scheduling  ailcs  can  be  incomparable.  We  now  define  the  concept  of  an  optimal  application  inde¬ 
pendent  scheduling  rule. 

Definitiim  2.22:  Let  be  die  set  of  all  the  application  independent  scheduling  rules.  An  application 

Independent  scheduling  ailc  R  e  A^  is  said  to  be  optimal  if  U  is  at  least  as  concurrent  as  any  ailc  in  A^. 

That  is. 

V(R6A  J(R*>  R) 

A 

We  now  introduce  the  concept  of  completeness.  A  family  of  modular  scheduling  nilcs  is  said  to  form  a 
complete  class  within  the  set  of  modular  scheduling  rules  if  and  only  if  given  any  modular  scheduling  mlc  R 
we  can  always  find  a  rule  R  in  this  family  of  rules  such  that  R  is  at  least  as  concurrent  as  R. 

Definition  2.23:  Let  A^  be  the  set  of  all  the  consistent  and  correct  modular  scheduling  rules.  A  set  of 

consistent  and  correct  modular  scheduling  rules  %  is  said  to  form  a  complete  class  within  A^.  if  and  only  if 

V(R€A^)(3(R*e^]l.XR*>R)] 

2.4  Summary 

In  this  chapter,  we  have  formally  defined  our  model  of  modular  scheduling  rules.  In  Sections  2.1  and  22, 
we  provided  a  formal  description  of  the  standard  concepts  in  the  study  of  transaction  systems  such  as 
database,  transactions  and  schedules.  In  Section  2.3,  we  defined  the  concepts  related  to  modular  scheduling 
rules  and  their  important  properties.  Having  formally  defined  these  basic  concepts,  we  now  move  on  to  the 
investigation  of  the  structure  of  the  daubase  that  can  be  used  for  the  purpose  of  concurrency  control  in  the 
light  of  the  knowledge  of  database  consistency  constraints. 


468 


3.  Atomic  Data  Sets 

In  Chiiptcr  2,  wc  developed  a  formal  model  oftnodular  scheduling  rules.  This  model  provides  us  with  die 
theoretical  basis  for  the  development  of  new  concurrency  control  rules.  Wc  now  develop  a  major  concept 
called  atomic  Jala  sets,  which  is  important  to  the  development  of  modular  concurrency  control  and  failure 
recovery  Riles.  Intuitively,  an  atomic  daui  set  is  simply  a  .set  of  data  objects  with  a  set  of  assixriaicd  consis¬ 
tency  constraints  that  can  be  satisfied  independently  of  other  atomic  daui  sets.  1-or  example,  in  a  distributed 
computer  system  a  job  queue  at  a  computer  waiting  for  execution  is  an  atomic  data  set,  the  local  directory  to  a 
computer's  files  is  an  atomic  data  set.  and  each  user’s  mailbox  is  also  an  atomic  daUi  set  'I'his  is  because  the 
consistency  of  each  of  these  entities  is  defined  by  a  set  of  consistency  constraints  that  can  be  satisfied 
independently  of  others. 

It  is  imporuint  to  note  iliat  inter-related  data  objects  can  be  in  different  atomic  datti  sets,  as  long  as  the 
consistency  of  an  atomic  data  set  can  be  satisfied  independently  of  other  atomic  data  sets.  Kor  example,  we 
can  move  a  job  waiting  on  tiic  job  queue  of  one  computer  to  the  job  queue  of  another  computer  in  order  to 
balance  the  workloads.  'Ihat  is.  these  job  queues  arc  related  by  some  load  balancing  transactions.  However,  at 
each  computer  a  load  balancing  transaction  must  observe  the  consistency  constraints  of  the  local  job  queue 
such  as  the  maximum  size  and  capability  constraints.  'Hie  capability  constraints  associated  with  a  job  queue 
arc  a  specification  of  the  types  of  jobs  that  arc  allowed  to  execute  on  a  computer. 

Given  a  set  of  consistency  constraints,  of  the  database  (shared  system  data  objects),  we  show  in  Theorem  3.2 
of  this  chapter  that  there  always  exists  a  unique  maximal  consistency  preserving  partition  of  the  database. 
That  is,  there  is  always  a  unique  way  to  partition  the  database  into  the  largest  possible  number  of  atomic  data 
sets  when  the  consistency  constraints  are  given.  In  Chapter  4,  we  will  show  that  if  each  transaction  is 
consistent  and  correct  when  executed  alone,  then  each  transaction  will  also  be  executed  consistently  and 
correctly  when  it  is  executed  setwise  serializably.  That  is,  when  the  database  is  partitioned  into  atomic  data 
sets,  we  only  need  to  enforce  serializability  of  concurrency  control  on  a  set  by  set  basis.  To  improve  system 
concurrency,  we  would  like  to  have  many  small  atomic  data  sets  instead  of  a  few  large  ones  in  the  design  of 
the  database.  Theorem  3.2  of  this  chapter  indicates  that  the  number  of  atomic  data  sets  we  can  have  is 
determined  by  the  consistency  constraints.  Therefore,  there  is  a  motivation  to  design  "weaker"  consistency 
constraints  in  a  distributed  computer  system  in  order  to  explore  the  potential  concurrency  provided  by 
distributed  hardwares.  In  Section  7.2,  we  use  an  example  to  illustrate  why  distributed  computer  systems  call 


469 


h>r  iipprojcli  ti)  specify  consistency  constraints  that  is  dilfcrcnt  frtim  the  approach  which  we  liave  got 
xcusiomed  to  in  the  design  iif  centrali/cd  computer  systems.  Having  introduced  die  relationship  between 
concurrency  control  and  die  concept  of  atomic  data  sets,  we  now  fonnali/.e  diis  concept  in  this  chapter. 

3.1  Consistency  Preserving  Partition  of  The  Database 

In  diis  section,  we  first  define  die  concepts  of  atomic  data  sets  and  a  consistency  prcscr\  ing  partition  of  die 
databa.se.  Nc.xl.  we  show  diat  the  conjunction  of  die  consistency  constraints  of  die  atomic  data  sets  is 
equivalent  to  that  of  the  database.  Finally,  we  prove  that  there  is  always  a  unique  maximal  consistency 
preserving  partition  with  respect  to  a  given  set  of  consistency  constraints. 

Definition  3.1-1:  Let  I  =  {I.  2 . n}  be  die  index  set  of  database  I).  The  index  i  €  1  specifics  the  d.ita 

object  e  I).  l.ct  )  denote  the  projection  of  an  n-tupic  V  €  ft  using  the  set  of  indices  S  C  L  I’hat  is. 

17  ^(V)  denotes  the  tuple  whose  elements  are  the  values  of  the  daui  objects  indexed  by  S.  l.ct  P  =  {S^ . Sj.} 

denote  a  partition  of  1.  Let  be  the  set  whose  elements  arc  die  projections  of  all  die  consistent  suites.  X  €  L. 

onto  an  arbitrary  index  set  S^.  chat  is.  V.  =  {ir^(X)}.  A  partition  of  the  index  set  I.  P  =  {Sj,  .S, . 

S^},  is  said  to  be  eonsistcncy  preserving  (CP)  if  and  only  if, 

V(Y  e  Q){  [175 (Y)  €  V.,  i  =  1  to  k]  -  [Y  €  U] }. 

i 

Definition  3.1-2:  An  atomic  data  set  J..  for  a  CP  partition  P  is  the  set  of  data  objects  specified  by  S.  €  P. 
The  associated  partition  of  data  objects  in  D.  Q.  is  called  a  consistency  preserv  ing  partition  of  D. 

The  definition  of  a  CP  partiuon  states  that  a  CP  partition  has  the  property  that  any  choice  of  the  consistent 
states  of  the  atomic  data  sets  leads  to  a  consistent  state  of  the  database.  We  now  introduce  the  concept  of 
consistency  constraints  of  an  atomic  data  set.  Next  we  show  that  the  consistency  constraints  of  the  database 
can  be  decomposed  into  sets  of  ADS  consistency  constraints. 

Definition  3.2:  Let  P  be  a  CP  partition  of  I  and  let  S  C  I  be  the  set  of  indices  which  specify  the  data 

objects  in  the  atomic  data  set  JL.  C  D.  Let  the  set  of  ad  the  consistent  states  of  an  ADS  J..  be  U.  = 

Uxgy{ir5(X)}.  The  set  of  consistency  constraints  C  whose  truth  set  is  the  consistent  states  of  X  is  called  the 
i  * 

ADS  consistency  constraints  of  X..  That  is. 


470 


I'roof:  irs  =  I.  Uicn  \^)  =  \,.  and  die  result  follows. 

1  Cl  I*  =  {.S,  Cj . CT|.}.  k  >  1  be  a  Cl*  partition. 

Define  W'  =  W  i  =  1  lo  k;  so  tliat  W  €  U.  i  =  0  to  k. 

0  2  I  1  \ 

77^(Wo)=  TTsdlsOii-Xp) 

TT  (W)  =  7r^(HJX,.XJ).i  =  Itok. 

<T  1  <7.  ^  1  Z 

1  1 

Given  that  1*  is  CP,  X,)  is  tlicrcforc  in  U  by  tlic  definition  of  a  CP  partition.  Ilius  maps  pairs  of 

consistent  state  into  a  consistent  state.  □ 

When  we  have  two  or  more  distinct  CP  partitions  of  the  Siime  index  set  I.  sets  from  distinct  partitions  could 
intcrscrt.  I.emma  3.2*2  generalizes  Lemma  3.2-1  by  allowing  the  intersections  to  be  used  for  tlic  specification 
of  swapping.  For  example,  let  1%  =  {{!}.  {2.  3}  }  be  a  second  CP  partition.  Ibc  intersection  of  {1.  2}  € 
and  {2.  3}  €  P^  is  {2}.  Lemma  3.2-2  suites  that  the  two  suites  resulting. from  swapping  the  projections  of 
and  B  specified  by  {2}  arc  also  consistent.  That  is.  (a^.  b,.  a^)  and  (b^.  a^,  bj)  arc  consistent.  We  show  tlic 
consistency  of  (a^,  b^.  a^)  as  follows.  First,  we  use  Lemma  3.2-1  to  swap  the  projections  of  A  and  B  specified 
by  {IjofP,.  One  of  the  two  resulting  consistent  states  is  F  =  (aj.  b,.  b^).  Next,  swapping  the  projections  of 
A  and  K  specified  by  {3}  of  P^,  we  find  (a^,  b^,  a^)  to  be  a  consistent  state  also.  We  now  give  a  general  proof 
of  Lemma  3.2-2. 

Lemma  3.2-2:  Suppose  that  S  €  and  <r  €  P^,  where  Pj  and  P^  arc  CP.  Then  U  x  U  — »  U. 

Proof:  If  S  e  P^,  then  there  exists  a  CP  partition  P  such  that  €  P.  Since  X^.  Xj  €  U,  it  follows  that 
HgcfXj.  Xj)  €  U  by  Lemma  3.2-1.  Therefore,  H^(X^,  H^cfX^,  X^))  e  U  as  well.  The  Lemma  follows,  since 
H,(X,H5c(X2.X^))  =  H3^^(X^.Xp.n 

Lemma  3.2-3  demonstrates  that  the  least  common  refinement  of  any  two  CP  partitions  is  also  a  CP 
partition.  For  example,  the  least  common  refinement  of  P^  and  P^,  {{1},  {2}.  {3}},  is  also  CP.  In  this  case, 
given  a  set  of  consistent  sutes  such  as  (a^.  a,,  a^).  (b^,  b^.  b^)  and  (Cj.  c,.  Cj).  we  must  prove  that  {a^,  b^.  Cj}  is 
also  consistent  First,  we  apply  the  intersection  of  {1}  and  {1.  2}  to  "A,  B"  and  "A.  C"  respectively.  States  (a^, 


471 


li^  =  {  7r^(V)|C(7r^(V))  =  1  } 

"  i  ‘  i 

riiCDicin  3.1:  The  ainjiinclioo  oC  .ill  llio  ADS  coastniiDis  C.  i  =  1  to  k.  is  ci|iiiv;iloiU  to  consistency 
consiaiints  C  oi  l).  That  is. 

C  =  Cj  A  C,  A ...  C^. 

Doof;  I  .Cl  U  be  the  truth  sei  of  the  conjunction  of  all  the  ADS  consiraints.  We  have. 

U’  =  {V  |C_(7r^(V)).  i  =  I  tok} 

{ 

=  {V  £  L\.  i  =  1  tok}  =  U. 

'  i 

Hence.  C  =  C,  AC,  A  ...  C, .  □ 

1  2  k 

We  now  prove  tJte  tlieorem  that  there  always  exists  a  unique  ma.xiinal  C  P  partition.  KirsL  we  note  that  CP 
panitions  exist,  since  iJic  trivial  partition  {  I  }  is  CP.  Fiirthennorc.  the  CP  partitions  arc  partially  ordered  by 
refinement.  That  is.  for  any  pair  of  CP  partitions  P^  and  P,.  partition  P^  is  refined  by  P^  if  and  only  if 
V(S^  €  P^)  3(Sj  €  C  S,).  A  maximal  CP  partition  is  one  which  is  refined  by  no  other  CP  partition. 

In  tlte  following,  we  will  prove  that  there  exists  a  unique  maximal  CP  partition  P.  Since  the  proof  is  relatively 
complex,  we  would  like  first  to  illustrate  die  ideas  of  the  proof. 

The  proof  is  based  on  three  lemmas.  The  idea  of  Lemma  3.2-1  is  illustrated  by  the  following  example.  Let 
Pj  =  {  {1.  2},  {3}  }  be  a  CP  partition  of  the  index  set  I  =  {1.  2,  3}.  Suppose  that  .\  =  (a^,  a^,  a^)  and  B  = 
(bj.  bj,  bp  are  two  consistent  states.  Let  S  be  a  partition  set,  either  {1,  2}  or  {3}.  Lemma  3.2-1  states  that  the 
two  new  states  which  result  from  swapping  the  projections  of  and  B  specified  by  S  are  also  consistent.  That 
is.  (a^,  a^.  bp  and  (b^.  b^.  ap  are  consistent. 

We  now  define  a  mapping  Hj,:  Q  x  Q  — »  Q  as  follows,  where  SCI.  Given  that  X^,  €  Q,  Hj(Xj,  Xp  = 

Y.  where  Y  satisfies  w^fY)  =  w^X^)  and  w^t^Y)  =  ir  j(<Xp,  where  =  I  -  S.  Thus  Hg(X^,  Xp  replaces 
the  projections  of  specified  by  S  with  the  projections  of  X^  specified  by  S. 

Lemma  3.2-1:  Suppose  that  Xj.  X^  €  U.  If  S  is  an  element  of  any  CP  partition  of  1.  then  H^:  U  x  U  -*  U. 


472 


sortwarc  modules:  modules  that  cnciipsulatc  disiribuicd  daui  objocis.  1-or  example,  one  can  gencrali/e  ihc 
monitor  (or  absinict  data  ty|')c)  approach  dcvclt)ped  for  cciurali/cd  systems  as  follows.  We  can  define  a  set  of 
primitive  procedures  for  each  of  tlie  dau  objects  in  an  atomic  data  set  to  facilitate  the  manipulation  of  tliese 
data  objects.  When  the  scheduling  rules  grant  one  die  privilege,  one  is  entitled  to  use  those  pre-defined 
pnxrcdurcs  its  building  blocks  of  one’s  «)wii  trans^iction.  In  the  next  chapter,  we  show  how  the  concept  of 
atomic  data  sets  can  be  used  for  concurrency  control. 


473 

b,.  b^)  and  (a^.  c,.  Cj)  arc  two  of  tlic  four  new  consistent  swtes.  Next,  we  apply  tlic  iincrscciion  of  {2.  3}  and 
{3}  to  these  two  new  states.  One  of  the  two  resulting  consistent  suites  is  {Uj.  b,.  c  We  now  give  a  general 
proof  of  I -cinma  3.2-3. 


I  emma  3.2-3  if  Pj  and  arc  CP.  then  their  least  common  rennement  is  also  CP. 


Proof:  I.et  P,  =  {S, . S  }  and  P.  =  {a, .  a  }.  Ihcir  least  common  refinement  is  P.  O  P,  = 

■  1  '  I  m*  2  1  n  1  i 

{C^.  ..„  Cj^},  where  C.  =  S.  H  <t^  fur  some  j.  k.  i  =  1  to  L. 


Let  X.  €  U,  i  =  1  to  1.  and  Y  €  fl  be  given  such  that  ir^.(X.)  =  (V).  i  =  1  to  1..  We  must  prove  V  €  U 

*  i  i 

to  conclude  that  the  P^  fl  P^  is  CP. 


To  this  end.  we  define  a  .sequence  {.X  i  =  1  to  1.}  as  follows:  X 
L.  Noting  that  C.  =  S.  0  <t^.  Lemma  3.2*2  indicates  that  €  U,  j  = 


,  =  \,.  \  =  IL.(X 

i  I  J  i' 

1  to  I..  It  follows  lliat  X  I 


.  X  ).  j  =  2  to 
J 

=  Y  £  U.  □ 


ITtcorcm  3.2:  There  exists  a  unique  maximal  Cl*  partition. 


Proof:  Suppose  that  there  exists  more  than  one  maximal  CP  partition.  'The  least  common  refinement  of 
distinct  maximal  CP  partitions  is  CP  by  Lemma  3.2-3,  thus  contradicting  the  maximality  assumption.  □ 


Corollary  3.2:  There  exists  a  unique  maximum  CP  partition  Q  of  I). 


In  the  next  chapter,  we  show  how  consistency  preserving  partitions  can  be  used  to  schedule  transactions  in 
a  non-serializable  fashion.  Although  any  CP  partition  can  be  used.  Theorem  3.2  indicates  that  there  is  a  CP 
partition  that  is  "most  refined”  with  respect  to  a  given  set  of  consistency  constraints.  This  partition  will  allow 
the  maximal  concurrency  in  concurrency  control. 


3.2  Summary 

In  this  chapter,  we  have  defined  the  concepts  of  atomic  data  sets  and  of  a  consistency  preserving  partition  of 
the  dauibase.  We  have  also  shown  that  there  exists  a  unique  maximal  consistency  preserving  partition  of  the 
database  with  respect  to  the  given  set  of  consistency  constraints. 


From  an  application  point  of  view,  atomic  data  sets  can  also  be  used  as  a  basis  for  constructing  distributed 


474 


4.  The  Setwise  Serializable  Scheduling  Rule 

An  ohjccuvc  ol'lliis  work  is  U)  Jcii\cr  provnblv  consisicnt.  coitclI.  moduinr  niul  optimal  scheduling  rules. 
Having  developed  a  model  of  modukir concurrency  control  rules  and  tiic  concept  ofatomic  data  secs,  we  arc 
now  in  a  position  to  introduce  such  a  laile. 

Conceptually,  the  major  result  of  this  chapter  can  be  stated  as  follows.  When  the  database  is  partitioned 
into  consistency  preserving  atomic  daui  sets,  we  only  need  to  enforce  the  scrializability  of  concurrency  control 
on  a  set  by  set  basis  rather  than  globally.  As  long  as  each  transaction  is  ctmsistent  and  correct  when  executing 
alone,  it  will  be  executed  ctmsistently  and  correctly  provided  that  it  is  executed  setwise  seriali/ably.  I-'urthcr- 
more.  this  is  also  the  best  we  can  do  in  the  absence  of  trans<iction  semantic  infonnation. 

1  Ills  conceptual  understanding  is  formally  expressed  .is  the  optimtil  applicaiion  independent  scheduling 
rule;  ihe  setwise  scriaiizabic  sehcihilina  rule.  Given  a  CP  partitioned  dauabasc  and  a  transaction,  the  setwise 
scnali/ablc  scheduling  rtile  partitions  the  steps  of  the  given  transaction  into  atomic  step  segments  in  die  form 
of  transaction  ADS  segments.  A  transaction  Ali)S  segment  in  a  transaction  is  simply  all  the  steps  in  diis 
transaction  accessing  the  same  atomic  data  set.  Wc  prove  that  die  setwise  scnali/ablc  scheduling  rule  is 
consisteut  correct  and  modular.  In  addition,  wc  prove  that  tliis  rule  is  die  optimal  application  independent 
scheduling  rule.  In  other  words,  this  is  tlic  rule  that  provides  the  highest  degree  of  concurrency  and  does  not 
use  any  transact. on  semantic  information.  I'hc  term  "application  independent"  refers  to  the  fact  that  diis  rule 
can  be  applied  to  transactions  by  an  algorithm  which  docs  not  know  the  semantics  of  transactions. 

This  chapter  is  organized  as  follows.  In  Section  4.1.  we  define  the  concepts  of  setwise  serializable  schedules 
and  the  setwise  serializable  scheduling  rule.  In  Section  4.2,  we  prove  that  this  rule  is  consistent  and  correct  in 
the  context  of  single  level  transactions.  In  Section  4.3,  wc  generalize  these  results  to  nested  transactions.  In 
Section  4.4,  we  prove  that  this  rule  is  application  independent  and  optimal.  Finally,  we  summaoze  this 
chapter  in  Section  4.5. 

4.1  Definitions 

In  this  section,  we  first  define  the  concept  of  a  transaction  ADS  segment.  Wc  then  define  the  scheduling 
rule  called  the  setwise  scm/irab/c  scheduling  rule.  Finally,  wc  define  tlic  set  of  schedules  tliat  satisfy  tliis  rule. 


475 


Definition  4,1:  A  transaction  ADS  segment  is  the  sequence  of  steps  in  a  transaction  that  read  of  write  daD 
objects  in  the  same  ADS.  I.et  'k(i.  U)  denote  the  transaction  ADS  segment  of  transiKtion  'I',  ticccssing  ADS 
A.  Let  t. .  >  t,  denote  tliat  step  t. .  is  executed  after  step  t.  We  have 

i.j  k.m  i.j  k.in 

1. '{'(i, -i)  =  {t|(t€  T.)  A  (t  reads  or  writes  a  data  object  in  ADS  -4)} 


'i'(i..A))  A{t..>t^))((t..t.^ 


€  V 


A  (t.  .>L,)) 
I.J  i.k 


Definition  4.2:  llic  scheduling  rule  that  partitions  each  of  the  transactions  in  a  transaction  system  T  into 
transaction  ADS  segments  is  called  the  setwise  serializable  schcd\i\\ng  rule. 


A  setwise  serial  schedule  z  for  a  transaction  system  T  is  a  schedule  in  which  transiiction  ADS  segments 
accessing  the  same  ADS  do  not  overlap.  1  his  concept  is  formalized  as  follows. 


Definition  4.3:  Let  t/'*  and  t.''"’  denote  the  first  step  and  Ur:  last  step  of  transaction  ADS  segment  'Ki.  M 
respectively.  Let  Q  be  a  coiisi.stency  preserving  partition  of  1).  A  schedule  z  for  transaction  system  L  is  said 
to  be  setwise  scriai  if  and  only  if  under  z. 


V ( €  Q)  V (T,  €  T)  Vft"^  e  z  A  t^ « T.)  ( (t.''-^  >  t^)  V  (t''  >  t.''-'") ) 
where  t"^  represents  any  step  accessing  A13S  A  in  the  transaction  system  T. 


Having  defined  die  concept  of  setwise  scriai  schedules,  we  now  define  a  setwise  serializable  schedule  as  one 
which  is  equivalent  to  a  setwise  scriai  schedule. 


Definition  4.4:  A  schedule  z  for  transaction  system  T  is  said  to  be  setwise  serializable  if  there  exists  a  setwise 
scriai  schedule  z  for  T  such  uhat  z  =  z  . 

4.2  Consistency  and  Correctness 

We  now  prove  that  setwise  serializable  schedules  are  consistent  and  correct.  The  proof  is  organized  into 
three  lemmas.  Let  be  a  consistent  and  correct  transaction.  In  Lemma  4.1T.  we  prove  that  T.  preserves  the 
consistency  of  each  of  the  accessed  atomic  data  sets  and  produces  correct  results  when  executing  alone.  In 
Lemma  4.1-2,  we  further  prove  that  at  the  end  of  executing  a  transaction  ADS  segment  'l'(i.  A)  of  T,  die 
consistency  of  A  has  been  already  preserved.  In  addition,  the  output  values  of  data  objects  in  A  arc  correct  at 


476 


the  end  of  -4.).  We  need  not  wait  for  Llio  end  of  T.  to  know  tJicse  results.  In  l.cminu  4.1-.^,  we  relax  the 
exa'uting  alone  condition.  We  show  that  die  results  of  i.einma  4.1-2  are  still  valid  for  any  ADS  ./t.  as  long  as 
J.  is  consistent  at  Uic  beginning  of  transtiction  segment  4^(1.  -4). 


IXifinition  4.5;  An  ADS  J.  is  said  to  be  acccs,scd  bv  a  transaction,  if  this  transiiction  reads  or  writes  one  or 

- J 

more  data  objects  in  -4... 


Lemma  4,1-1:  l.ct  Q  =  { . -4|^}  be  a  given  CP  partition  of  1).  Let  be  a  consistent  and  correct 

transaction.  IfT.  executes  alone  and  if  the  states  of  the  atomic  data  sets  accessed  by  arc  initially  consistent, 
then  at  the  end  of 'I',  the  state  of  each  of  the  accc“s.sed  atomic  data  secs  is  consistent,  and  die  values  output  by 
T.  satisfy  die  post-condition  of  T.. 


Proof:  let  ^. ..  i  =  1  to  k..  be  die  atomic  data  sets  accessed  bv  transaction  L  .  Let  ^  e  12  be  a  suite  of  I)  such 

-  i.j  •*  I  ■  I 

that  (Y))  =  1.  j  =  1  to  k.;  where  represents  the  ADS  consistency  constrainis  of  U., ..  and  S.^ 

represents  die  index  set  of  .  Now  let  X  be  a  consistent  suite  of  die  datiibasc  such  that  (\)  =  (^ ).  j 

=  1  to  k..  Next,  we  let  'I',  execute  alone  with  the  database  initially  in  suite  X.  We  now  prove  diat  with  cither 
X  or  Y  as  initial  state,  the  executions  of  T.  arc  cquivalcnu 


By  assumptions  A1  to  A3  in  Section  2.2.1,  with  X  as  initial  state.  T.  produces  correct  results  and  preserves 
the  consistency  of  die  daubasc.  It  follows  that  'l\  preserves  the  consistency  of  all  die  atomic  data  sets.  To 
complete  the  proof,  we  must  show  that  the  values  of  the  loca>  variables  and  the  values  of  the  global  variables 
of  T.  are  identical  in  both  schedules. 


Let  the  values  of  the  local  variable  and  the  data  object  in  the  execution  with  initial  state  X  be  L*  and 
.  V 

O  .  Let  the  values  of  the  local  variable  and  the  data  object  with  initial  state  Y  be  L,  and  0,  .  We  must 
U  ,  ,  i./  \.l 

prove  that  L  =  L  and  O  =0  at  each  step  of  the  transaction.  Recall  that  the  syntax  of  a  transaction 
\l  it  it  it 

step  is  as  follows. 


it  ul 


Since  the  initial  states  of  all  accessed  ADS  are  equal  with  cither  X  or  Y  as  die  initial  state,  it  follows  that  the 


477 


iniliiil  values  of  llic  accessed  daw  ohjecis  arc  equal.  Hence,  at  the  first  step  of  I’  .1.  =  •-  ,  In  addition. 

.  .  ‘  '•>  '•* 

O  =  f  (L.  )  =  f  (l/,  )  =  0  ,  .  Next.  L  =1.  .  because  the  second  step  eitlicr  reads  the  initial 

I  ,  I-  ,  t.  ,  I-  1  I-  t  ^*1  ^  ^ 

1. 1  1,1  1.1  1. 1  i.l  1,1  1.2  1.2  ^ 

value  of  a  data  object  or  the  value  of  die  daw  object  output  by  step  1.  Similarly.  O  =  0  . 

1.2  1.2 


Now  suppose  tliat  tliese  local  variable  and  data  object  value  pairs  arc  equal  from  steps  1  to  r.  I  hat  is,  I. 

=  l/  and  O  =0*  .  h  =  1  to  r.  We  show  diat  I,  =1.  .  This  follows  bccau.se  step  r+1  either 

S.h  ^.h  *i.h  i.r+1  i.r+l 

reads  the  initial  value  of  a  daw  object  or  a  data  object  which  has  been  output  by  some  step  between  1  to  r.  It 

follows  that  O  =0*  .  Ily  induction,  the  final  values  of  accessed  data  objects  with  either  X  or  Y  as 

u+l  Vr+1 

initial  state  arc  equal.  Since  the  values  of  data  objects  in  j  =  1  to  k..  not  accessed  by  remain 

unchanged,  they  must  be  equal  at  die  end  of  tnc  transaction  with  either  X  or  V  tes  the  initial  swte.  Ihat  is. 
(W  )  =  (W  ),  j  =  1  to  k..  where  W  and  W  are  die  states  of  I)  at  the  end  of  executing  T  with  X  and 

S.  V  S.  V  ■’  I  X  V  ^  i 

i.j  1.J 

V  as  initial  swtes  respectively.  Since  the  execution  using  X  as  the  initial  state  is  assumed  to  preserv'c  the 
consistency  of  each  of  the  acces-sed  atomic  daw  sets,  die  execution  using  V  as  the  initial  state  must  also 
preserve  the  consistency  of  each  of  die  accessed  atomic  daw  sets.  Since  die  execution  with  X  as  die  initial  state 
produces  correct  results,  the  execution  using  V  as  the  initial  state  must  also  produce  correct  results.  □ 


There  are  two  implications  from  Lemma  4.1-1.  First,  after  the  decomposition  of  the  dawbasc.  a  program¬ 
mer  is  only  required  to  know  and  mainwin  the  consistency  constraints  of  the  atomic  data  sets  accessed  by  his 
transaction.  Without  a  CP  partition,  everyone  must,  in  principle,  know  all  the  dawbasc  consistency  constraints 
in  order  to  verify  that  one's  transaction  will  not  violate  any  of  them.  Second,  this  lemma  implies  that  a 
transaction  can  still  function  properly  as  long  as  the  accessed  atomic  data  sets  arc  consistent,  even  if  the  rest  of 
the  atomic  data  sets  are  inconsistent.  This  is  useful  in  fault  isolation,  although  this  topic  is  outside  the  scope 
of  this  work.  We  now  turn  to  the  second  lemma. 


Lemma  4.1-2:  Let  the  atomic  data  sets  accessed  by  transaction  T.  be  .X.^,  j  =  1  to  k..  If  .4.^,  j  =  1  to  k.,  are 
initially  consistent  and  if  T,  executes  alone,  then  at  the  end  of  transaction  ADS  segment  'F(i,  J..^)  the 
consistency  of  Ji. .  is  preserved.  Furthermore,  the  values  of  data  objects  in  J..  output  by  'I'(i,  A  )  arc  correct 
at  the  end  of  'i'(i, 


Proof:  At  the  end  of  the  transaction  ADS  segment  'f'(i,  A. .),  j  =  1  to  k.,  the  daw  objects  in  U..  ,  j  =  1  to  k 

I  ^ 

arc  neither  read  or  written  again.  It  follows  that  the  values  of  the  data  objects  in  ,4.. ..  j  =  I  to  k .  arc  the  same 

IJ  \ 

as  at  the  end  of  the  transaction.  By  Lemma  4.1-1.  at  the  end  of  the  transaction,  the  consistency  of  each  of  the 


478 


atomic  data  sets  is  preserved,  and  the  values  of  tlie  data  objects  output  by  T.  arc  correct  it  follows  that  at  the 
end  of  each  of  the  transaction  ADS  segments  the  state  of  die  accessed  atomic  data  set  is  consistent  and  the 
values  of  the  data  objects  output  by  the  segment  arc  correct  □ 

lemma  4,1-3:  In  a  setwise  serial  schedule,  if  at  the  beginning  of  a  transaction  ADS  segment  'l'(i.  A .).  =  1 
to  k..  ADS  J.. .  is  initially  consistent  then  Ji.  is  consistent  at  the  end  of  '►(i.  X .).  and  the  values  of  each  of 
the  data  objects  in  X .  output  by  T.  arc  correct 

Proof:  Let  the  atomic  data  sets  accessed  bv  T.  be  X. ..  i  =  1  to  k..  Now  let  T.  execute  alone  in  a  serial 

-  I  y  •’  I  I 

schedule  z  with  the  initial  states  of  -4.  ,  j  =  1  to  k.,  being  identical  to  the  initial  states  of  X  ..  j  =  1  to  k.,  in 
the  setwise  serial  schedule  z. 

•  »  m 

Let  the  values  of  the  local  variables  and  data  objects  in  the  serial  schedule  z  be  L  and  O  ;  and  those 

U  u  , 

in  setwise  serial  schedule  z  be  L  and  .  W*'  now  prove  diat  the  executions  of  under  /.  and  z  arc 

*i./  U  ' 

equivalent  Recall  that  the  syntax  of  a  transaction  step  is  given  by 


U  1,1 


Since  the  initial  states  of  ADS  j  '=  1  to  k.,  arc  equal  in  both  schedules,  the  initial  values  of  all  the  data 

objects  in  X^,  j  =  1  to  k.,  are  equal.  Therefore,  the  first  steps  in  both  schedules  input  the  same  value.  That 

is,  L  =  L*  .  In  addition,  O,  =  f  (L  )  =  f  (L*  )  =  O*,  .  Next,  L  =  L*,  ,  because  step  two 

ui  u  a  a  u  i.1  ui  a.  a  \2 

either  reads  the  initial  value  of  a  data  object  or  the  value  of  the  data  object  output  by  step  1.  Similarly,  0  = 


Now  suppose  that  these  local  variable  and  data  object  value  pairs  are  equal  from  steps  1  to  r.  That  is,  L 

.  .  .  *.»» 

=  L  and  0  =0  .  h  =  1  to  r.  We  show  that  L  =  L  .  This  follows  because  step  r + 1  cither 

i.h  uh  uh  ur+l  u+l 

reads  the  initial  value  of  a  data  object  or  a  data  object  which  has  been  output  by  some  steps  between  1  to  r.  It 

follows  that  0  =0*  .  Therefore,  the  final  values  of  accessed  data  objects  in  both  schedules  are 

S,r+1  H.r+1 

equal  at  the  end  of  each  transaction  ADS  segment.  In  addition,  data  objects  in  j  =  1  to  k.,  not  accessed 
by  T^  remain  unchanged  and  therefore  equal  at  the  end  of  each  transaction  ADS  segment  for  both  schedules. 
It  follows  from  Lemma  4.1-2  that  at  the  end  of  Sk(i,  X .).  the  X .,  j  =  1  to  k.,  are  consistent,  and  the  value  of 
each  of  the  data  objects  in  -4^^,  j  =  I  to  k..  output  by  T.  is  correct.  □ 


479 


'I'hcurcm  4,1 :  A  setwise  serial  schedule  is  cunsistent  and  correct. 

Proof:  l.ct  Q  =  {.X^ . be  a  CP  partition  oCI).  let  llic  initial  suites  of  each  of  the  ADS ‘s  be  (OJ.  j 

=  1  to  it.  These  initial  states  arc  assumed  to  be  consistenu 

Since  a  schedule  is  a  touilly  ordered  set  of  steps  from  ail  die  transactions,  each  of  which  terminates,  there 

must  exist  a  transaction  ADS  segment  'Pli.  A.)  which  first  finishes  its  computation.  Let  the  asstxriatcd  ADS 

state  be  Z  [1].  Since  tliere  are  no  interleavings  among  transaction  ADS  segments  accessing  the  same  ADS  in 

a  setwise  serial  schedule.  Z  {!]  must  be  output  by  a  transaction  which  has  used  only  the  initial  states  that 

were  assumed  to  be  consistent.  By  Lemma  4.1-3.  Z  *{1]  is  consistent,  and  the  values  of  data  objects  in  A 

J 

output  by  '{'(i,  X)  are  oirrcct.  Consider  now  die  output  of  the  second  transaction  ADS  segment  produced  by 

the  schedule.  Since  it  can  use  only  Z.  [1)  or  Z  [0|.  m  =  1  to  k  and  m  7^  j.  at  tlic  end  of  this  second 

j  m 

transaction  ADS  segment,  tlie  accessed  atomic  data  set  is  in  a  consistent  state  and  the  output  values  arc  correct 
by  Lemma  4.1-3.  Now  assume  that  die  first  n  transaction  ADS  segments  produce  consistent  and  correct 
results.  The  n-i-1*  must  also  by  the  same  argument.  By  induction,  the  ADS  state  produced  by  each  of  the 
transaction  ADS  segments  is  consistent,  and  the  values  of  the  data  objects  output  in  each  ADS  at  the  end  of 
the  transaction  ADS  segment  satisfy  the  post-condition.  It  follows  that  a  setwise  serial  schedule  is  consistent 
and  correcL  □ 

Corollary  4.1-1:  Setwise  serializable  schedules  arc  consistent  and  correct. 

Proof:  It  directly  follows  from  Definition  4.4  and  Theorem  2.1.  □ 

Corollary  4.1-2:  The  setwise  serializable  scheduling  rule  is  consistent  and  correct. 

Proof:  It  directly  follows  from  Corollary  4.1-1.  □ 

Throughout  this  section,  the  choice  of  atomic  data  sets  has  been  arbitrary,  because  Theorem  4,1  applies  to 
any  CP  partition  whether  maximal  or  not.  If  the  CP  partition  consists  of  a  single  ADS.  then  setwise  serializ¬ 
able  schedules  reduce  to  serializable  schedules. 


480 


4.3  Nested  T ransactions 

In  ihc  early  work  on  scriali/abilily  ihcory,  a  transaction  was  modelled  as  a  sequence  of  steps.  However,  it  is 
natural  to  write  transactions  in  a  nested  form,  in  which  sub-transactions  can  be  executed  in  parallel  and 
invoked  by  higher  level  ones.  Recently,  serializable  nested  transactions  have  been  studied  by  Moss,  l.ynch 
and  Heeri  et  al.  [Moss  81,  Lynch  83a,  Becri  83].  We  have  proved  the  eonsistency  and  correctness  of  setwise 
serializable  schedules  in  the  context  of  single  level  transactions.  In  this  section,  we  now  generalize  this  result 
to  nested  transactions. 


4.3.1  Syntax 

We  can  visualize  a  nested  transaction  as  being  organized  in  the  form  of  a  tree  whose  nodes  are  sub- 
transactions  and  whose  leaves  are  steps.  The  execution  of  the  transaction  is  defined  by  the  partial  order  of  the 
tree.  Tlie  syntax  of  a  nested  transaction  is  formally  defined  as  follows. 

Definition  4.6:  A  nested  transtiction.  is  a  partially  ordered  set  of  steps  with  the  following  syntax; 

<NcstedTransaction>  ::=  neginTransaction  <NestedTTansactionBodv>  EndTransaction. 

<NestedTransactionBody>  =  BeginSerial  <SubTransactionList>  EndSerial  1 
BeginParallel  <SubTransactionList>  FndFLarallel 

<SubTransactionLisi>  ;:=  <SubTransaction>  j  <SubTransactionList><SubTransactionList> 

<SubTransaction>  ::  =  Step  |  <SubTransaction>;<SubTransaction>  1  <NestedTransactionBody> 

A  step  is  modelled  as  an  indivisible  execution  of  the  following  two  instructions. 


;=  O 


ij.k 


O  :=f  ({L  }U{L,  It.  <t..J) 

L..  L  t.  t.  '  um.n  ij.k-’' 

i.j,k  i.j.k  ij.k  um.n 

where  l  ...  is  the  k*^  step  at  level  i  of  transaction  T^.  L  is  the  local  variable  used  by  step  t. . . .  O.  is  the 

M.k  .  ..J.k  L  .  ^ 

data  object  accessed  by  step  t  .  and  f  represents  the  computation  performed  by  step  t.  .  Note  that  step 

tj,x  g.K 


481 


Lj  can  use  not  only  its  own  local  variable  but  also  tlic  Ujcal  variables  ass(x:iated  with  steps  preceding  it.  In 

this  model,  a  "read"  step  is  interpreted  as  one  which  writes  the  original  value  back  into  die  data  object.  'Iliat 

is.  f  is  die  identity  function, 

g.k 

4.3.2  Consistency  and  Correctness 

In  this  section,  wc  address  the  consistency  and  correctness  of  the  setwise  serializable  scheduling  rule  when  it 
applies  to  nested  transactions.  In  a  nested  transaction,  the  ordering  between  some  steps  is  unspecified  due  the 
partial  ordering.  This  implies  that  the  results  produced  by  any  total  ordering  that  is  consistent  with  the  partial 
ordering  in  the  transaction  must  be  equally  valid.  Otherwise,  one  should  specify  the  order.  Formally,  the 
issues  caused  by  partial  ordering  arc  addressed  via  the  concept  of  the  linearization  of  a  nested  transaction. 

Definition  4.7:  A  single  level  transaction  T  is  die  linearization  of  the  nested  transaction  T:  .  if  and  only  if 

C  W  C 

Tr  has  the  same  steps  as  T,  and  if  the  total  ordering  of  steps  in  T.  is  consistent  with  the  partial  ordering  of 
steps  in  That  is, 

[ V(tX  (t  €  Tf ) «  (t  €  )]  A  [ V  ( (t^,  t^  e  Tf)  A  (t^  >  t^) )  ( (t^.  t^  e  if)  a  (t^  >  t^) )] 

Definition  4.8:  A  nested  transaction  is  said  to  be  consistent  and  correct,  if  and  only  if  each  of  the  lineariza¬ 
tions  of  the  nested  transaction,  when  executed  alone  satisfies  our  three  assumptions  about  a  transaction:  it 
terminates  (Al),  preserves  the  consistency  of  the  database  (A2).  and  produces  correct  rcsults(A3).  In  the 
following,  we  limit  our  investigation  only  to  consistent  and  correct  nested  transactions. 

We  now  define  the  concept  of  a  nested  transaction  system  and  its  schedules. 

Definition  4.9:  A  nested  transaction  system  =  {T^ . }  is  a  finite  set  of  nested  transactions 

operating  upon  the  shared  database  D. 

Definition  4.10:  Let  T:  be  the  set  of  all  the  linearizations  of  nested  transaction  T‘  e  T‘  .  Let  T  be  a 

linearized  transaction  system  for  T  .  That  is,  T  =  {T^ . T^  is  a  transaction  system  in  which  T^  €  i  = 

1  to  m.  Let  be  the  set  of  all  the  linearized  transaction  systems  for  T^.  A  schedule  z  for  a  nested  transaction 
system  T*^  is  a  a  schedule  of  a  linearized  transaction  system  T^  €  T^. 

Theorem  4.2:  Setwise  serial  schedules  of  a  nested  transaction  system  arc  consistent  and  coi  rcct. 


482 


l^-Qot':  8y  definition,  when  e;ich  of  the  lineai  i/alions  of  a  nested  transaction  executes  alone,  it  terinin.iies. 
preserves  the  consistency  of  llie  databtisc  and  produces  correct  rcsulLs.  It  follows  from  I  heorein  4.1  that  a 
setw  ise  serial  schedule  for  a  line;iri/cd  nested  transaction  system  is  consistent  and  correct.  I  bis  is  true  for  all 
the  lineari/ations  t)f  the  given  nested  transtiction  system.  It  follows  that  setwise  serial  schedules  for  nested 
transiiction  systems  arc  consistent  and  correct.  □ 

Corollary  4.2-1 :  Setwise  scri;ili/ahle  schedules  for  nested  transtictions  .ire  consistent  and  correct. 

Corollary  4,2-2:  ITic  setwise  scriali/abic  scheduling  rule  is  consistent  and  correct  with  respect  to  nested 
transactions. 


4.4  Application  Independence  and  Optimality 

In  the  previous  section,  wc  proved  die  consistency  and  correctness  of  the  setwise  scriali/.ablc  scheduling 
ailc.  Wc  now  prove  that  this  n.ilc  is  <ilso  the  optimal  application  independent  scheduling  nilc, 

'I'hcorcm  4.3:  The  setwise  scriali/able  scheduling  rule  is  modular. 

Proof:  It  directly  follows  from  Definitions  2.18  and  4.2.  □ 

Theorem  4,4:  ITic  setwise  serializable  scheduling  aile  is  application  independent. 

Proof:  'Hie  kernel  of  the  setwise  serializable  scheduling  rule  is  the  function  that  partitions  the  steps  of  any 
given  transaction  into  transaction  ADS  segments.  Transactions  which  are  syntactically  identical  are  par¬ 
titioned  identically.  It  follows  from  Definition  2.20  that  the  setwise  serializable  scheduling  rule  is  application 
independent.  □ 

We  now  prove  that  the  setwise  serializable  scheduling  rule  is  optimal  in  the  set  of  application  independent 
scheduling  nilcs.  To  illustrate  the  basic  idea  of  the  proof.  let  us  consider  an  example  first.  Suppose  that  we 
have  an  atomic  data  set,  Ji.  with  two  data  objects  A  and  B.  The  consistent  states  of  this  ADS  arc  (0,  1)  and  (1, 
0).  Suppose  that  transaction  and  assigns  (0,  1)  and  (1.  0)  to  J.  respectively.  Let  Transaction  be  a 
transaction  which  has  two  steps  t^  ^  and  t^^  values  of  A  and  B  respectively.  Thus,  these  two  steps 

are  a  transaction  ADS  segment.  Suppose  that  an  application  independent  scheduling  rule  R  partitions  this 


483 


[runs;ictiun  ADS  scgiiicm  iiKo  two  ^itouiic  stop  segments  t^  ^  .ind  t^ ,  respectively.  In  the  following^  wc  show 
th.it  such  a  partition  causes  transaction  I to  see  an  inconsistent  .state  of  A. 

VVe  construct  a  schedule  /.  =  t^  T,.  t.^  ,>.  Note  that  this  schedule  satisfies  K,  since  t^  ^  and  t.^ ,  arc  two 
different  atomic  step  segments  under  R,  Since  1’^  assigns  (0.  I)  to  A  and  R.  step  t^  ^  will  sec  that  the  value  of 
A  is  0.  Since  l\  assigns  (1,  0)  to  A  and  R,  step  t^,  will  see  that  the  value  of  li  is  also  0.  Thus,  under  >. 
transiiction  T^  sees  an  inconsistent  state  (0.  0).  Since  rule  R  is  application  independent,  wc  can  inieipret  the 
semantics  of  T,  as  follows.  If  transaction  T,  sees  an  inconsistent  state,  tlicn  step  i, ,  outputs  an  incorrect  value 
2.  lliiis.  mlc  R  is  incorrect.  Since  rule  R  is  assumed  to  be  consistent  and  correct,  it  cannot  further  partition 
transiiction  ADS  segments.  It  follows  that  die  setwise  scrialixablc  scheduling  ailc  is  at  least  as  concurrent  as 
R.  In  summary,  the  key  to  the  proof  is  to  show  that  if  an  application  independent  scheduling  rule  R  partitions 
a  transaction  ADS  segment  cj  into  two  or  more  atomic  step  segments,  then  there  exists  some  schedule  / 
satisfving  R  such  that  the  in(Hit.i  to  a  under  /  do  not  come  iVoin  a  consistent  stale.  Wc  now  give  a  formal 
proof  ;is  follows. 

f  emina  4. 5- 1:  i.et  A  =  {O^ . be  an  .\DS  from  the  maximal  consisicncy  preserving  pnnition  of  I). 

l  et  die  index  set  of  J..  1.  be  partitioned  into  two  sets  and  S,.  I.et  U  be  die  universal  set  of  the  consistent 
states  of  J..  Wc  have 

3  (W'  €  U  A  X e  U  A  V  €  L)  (  (tt  ( W)  =  77 .  ( Y))  A  (77  (X)  =  77  (Y)) ) 

"1  "1  ^2  “^2 

% 

Proof:  Suppose  diat  the  claim  is  false.  By  Definition  3.1  (consistency  preserving  partition),  {Sp  S^}  is  a 
consistency  preserving  partition  of  1.  This  contradicts  the  assumption  that  D  is  maximally  partitioned.  □ 

[.emma  4.5-2:  Let  database  D  be  maximally  partitioned  into  atomic  data  sets.  Let  T  be  a  consistent  and 
correct  transaction  operating  upon  D.  Let  <7  be  a  transaction  .ADS  segment  in  Let  the  ADS  accessed  by  a 

be  A.  Suppose  that  cr  is  partitioned  into  atomic  step  segments  (t^ . c7|^.  k  >  1  by  some  modular  scheduling 

rule  R.  Then  there  exist  a  transaction  system  T  €  T  and  a  schedule  z  €  Zj^(T)  such  that  under  z  the  values 
input  to  a  are  not  projections  from  a  consistent  state  of  A. 

Proof:  Let  =  {<^2 .  transaction  ADS  segments  of  Suppose  that  a  modular 

scheduling  rule  H  partitions  transaction  ADS  segment  <t^  into  ^ . ^  following,  wc  prove 


484 


4.5  Summary 

In  ihis  clmpicr.  wc  base  dofincd  die  selwise  soriali/abic  sehcdiiliiii;  rule  and  proved  dial  diis  ride  is  die 
optimal  application  independeiu  sclicdiding  ride.  From  an  application  point  of  view,  once  ihc  atomic  data 
sets  and  consistency  constraints  arc  specified,  d'c  implementation  of  die  setwise  serial i/;iblc  scheduling  rule  is 
straightforward,  f  or  example,  we  can  modify  the  two  phase  lock  protocol  Il-.swaran  76)  inlo  vc/uot'  /no  /.’//dvc 
luck.  /\  setwise  two  phase  lock  protocol  requires  a  tran.saction  not  to  release  any  lock  on  any  d.ita  ohjocl  ofan 
atomic  data  set  until  all  die  locks  in  this  atomic  data  set  have  been  acquired.  Once  any  lock  in  an  atomic  data 
set  has  been  released,  no  more  data  objects  in  the  same  atomic  data  set  can  be  locked.  In  many  instances,  data 
objects  in  an  atomic  data  set  can  be  organized  into  a  tree  structure.  In  this  case,  the  tree  lock 
protocol  [Silbcrschatz  80)  can  be  adopted  to  enforce  setwise  scriali/ability.  This  protocol  has  the  adv.mtagc 
that  it  is  free  from  die  deadlock  problem.  This  protocol  requires  that. 

•  except  for  die  first  item  locked,  no  item  can  be  locked  unless  a  lock  is  cuiTentiy  held  on  its  p.ircnt. 

•  no  item  is  ever  locked  twice. 

Note  tJiat  the  first  item  need  not  be  the  root,  and  die  locking  need  not  to  be  two  phase. 


485 


ihiil  if  j  =  2  Llicn  liic  inputs  to  a  ;irc  not  projections  of  a  consistent  suite  of  If  j  >  2.  we  merge  a  ^ , . a 

into  a  single  atomic  step  segment  a,  and  dicn  use  die  result  for  j  =  2.  I  lierc  can  be  two  cases. 

Case  1:  Segments  <T|  ^  and  ,  access  disjoint  sets  of  data  objects  in  ADS  A.  Let  1  be  the  index  set  of  A. 
Partition  I  intti  and  I'uch  tliat  accesses  only  data  objects  indexed  by  S^.  i  =  1  to  2. 

Let  r  and  T  be  two  transactions  in  T.  assicning  consistent  states  W  and  X  to  A  respectively.  Let  W  = 

V  (W)  and  X,  =  w  (X).  Although  states  W  and  X  arc  consistent,  by  l.cinma  4.5-1  the  sutc  resulting  from 
■  *2 

the  combination  of  VV,  and  X,,  V  =  <VV  X,>.  could  be  an  inconsistent  state.  We  assume  that  tliis  is  the  case. 
12  12 

Let  /.  be  a  schedule  s.itisfving  R.  in  which  T  <  a,  ,  <  T  <  c, ..  where  ”  I’  <  a,  denotes  that  f  precedes 

®  w  l.l  X  J.2  w  LI  w 

a|  J  in  /..  Note  that  die  inputs  to  a  are  projections  from  die  inconsistent  state  Y. 

Case  2:  Segments  ^  and  ,  sliarc  at  least  one  data  object  in  .4..  i.et  one  of  these  shared  data  objects  be 
O.  Let /.  be  a  schedule  satisfying  R  in  which  T  <<j,,<T  Let  the  values  of  O  assigned  by  T  andT 

be  w  and  x  respectively.  Hence,  under  /.  diere  arc  two  steps  acces.sing  O.  inputting  values  from  two  dilTerent 
states  of  A.  it  follows  that  R  cannot  guarantee  that  all  the  inputs  to  a  ^  arc  projections  from  a  consistent  state 
of  A.  □ 

Theorem  4.5:  Given  a  maximally  CP  partitioned  database  D.  the  setwise  scnalizable  scheduling  rule  is 
optimal  in  the  set  of  application  independent  scheduling  rules  for  transactions  operating  upon  database  I). 

Proof:  Let  the  setwise  serializable  scheduling  rule  be  R  .  Let  R  be  any  other  application  independent 
scheduling  rule.  Let  T.  be  a  consistent  and  correct  transaction.  Let  the  database  be  maximally  partitioned  into 
atomic  data  sets.  Let  a  be  a  transaction  ADS  segment  ofT  accessing  ADS  A.  We  examine  the  partition  of  T. 
by  R.  If  any  ADS  segment  a  of  T.  is  partitioned,  then  by  Lemma  4.5-2  the  inputs  to  o  cannot  be  guaranteed 
to  be  the  projections  of  a  consistent  state  of  A.  Since  R  identically  partitions  all  the  transactions  equivalent  to 
T.  in  syntax,  we  can  interpret  the  semantics  of  T.  as  follows.  T.  produces  correct  results  if  and  only  if  the  input 
values  to  a  are  projections  from  a  consistent  state  of  A.  Hence,  if  any  transaction  ADS  segment  is  partitioned 
by  R.  we  can  construct  a  consistent  and  correct  transaction  and  a  schedule  z  satisfying  R  such  that  under  z, 
T.  produces  incorrect  results.  Thus,  R  is  incorrect.  It  follows  that  the  atomic  step  segments  of  T.  specified  by 
R  can  only  be  either  transaction  ADS  segments  or  the  supersets  of  them.  Therefore,  any  schedule  z  satisfying 
R  satisfies  R*  Hence,  R*  is  at  least  as  concurrent  as  R.  □ 


486 


5.  Compound  Transactions 


1..  lIK'  [''roMous  i.h.ip(or.  wo  sttiJiod  tho  uptim.il  .(pplic.uiini  milo|K’niJoiu  M'liodiilino  rule  —  the  solwl'ic 
>01  i.ili/ahlo  schcdulinu  rule.  Wo  imw  soaich  lor  .m  o\on  hiulioi  douioo  (il'ciinounviicv  in  inudular  cuiioiirronc\ 
ci'ii'uil  b\  ronnuing  tho  loquircinont  I'f  .ippiication  mdopciidonoo.  This  'ibjoeii\o  is  aLhio\od  In  doxoicping 
a  now  uansacCioii  struoturc  callod  lonii’nuiul  iraiis^uiiDm,.  and  its  associalod  schoduling  riiios  oallod 
sc/n/so  scr:ti;::jblc  sohodulmg  riilos.  The  strticturo  of  a  cuinpoiind  [laiisaoliun  is  Josiunod  lo 
uuli/c  dio  scmaiuio  inr'omiation  t'fa  given  transaction  Hy  partit.onmg  its  stops  into  a  partially  ordered  sot  of 
ii\tiiviLi:i  US.  Iho  syntax  of  each  oloincnt.iry  transaction  can  bo  m  the  I'oiin  of  either  a  single  love! 
or  .1  nested  trans.iotion.  \  compound  transaction  becomes  a  nested  transaction  when  it  consists  of  .udc  a 
single  eieineniaiy  ir.insjouon. 

Conceptually,  while  atomic  data  sets  result  from  a  consistency  preserving  partition  oi'tlie  database,  eleinen- 
l.ii;.  trans.ictions  residt  iimii  a  correctness  preserving  paiiiiu  ii  of  a  tr.nis.iction.  (liven  .i  datab.ise  so  its 
associated  consistency  constraints,  we  partition  the  database  into  ct'iisistency  preserving  atomic  data  sets.  \s 
long  as  the  consistency  of  c.ich  atomic  data  set  is  satisfied  individually,  the  consistency  of  the  si.itab.ne  is  alsvv 
satisfied.  Given  a  transaction  and  its  associated  post-conditions,  we  partition  die  transaction  into  eorreetness 
preserving  elementary  transactions.  .\s  long  as  the  post-condition  of  each  elementary  transaction  is  in- 
JiMdiialh  satisncd.  die  post-conditions  of  die  compound  transaction  arc  also  satisfied.  More  precisely,  a 
compound  transaction  consists  of  a  partially  ordered  sot  oh  elementary  transactions.  When  an  elemontar. 
transaction  is  oxecuied  senali7ably  and  m  an  order  consistent  with  the  partial  ordering  of  die  elementary 
transactions  in  die  compound  transaction,  it  has  the  following  two  properties.  First,  given  a  consistent  state  of 
the  database  and  the  results  passed  from  preceding  elementary  transactions,  an  elementary  transaction 
produces  another  consistent  suite  of  die  daubasc  and  satisfies  its  own  post-conditions.  Second,  tho  conjunc- 
uon  of  the  post-conditions  of  the  constituent  elementary  transactions  is  equivalent  to  the  post-conditions  of 
the  compound  transactions.  A  simple  example  of  compound  transaction  was  given  in  the  overview  of  this 
thesis.  .A  more  elaborated  example  is  given  in  Section  7.3.  Writing  Compound  Transactions. 


We  now  turn  to  the  subject  of  scheduling  compound  transactions.  Given  a  transaction,  the  generalized 
setwise  serializable  rtilcs  call  for  the  following  two  steps.  First,  we  need  to  express  tlic  transaction  in  the  form 
of  a  compound  transaction.  Second,  wc  partition  each  of  the  clemcnuiry  transactions  in  the  compound 


487 


ir;ins.ictioii  iiiU)  transiiciiim  ADS  scgincnis.  Wc  have  proved  iha(  gciicrali/cd  setwise  scriali/.ihle  scheduling 
rules  ire  cunsistciu.  correct  and  modular.  In  adilition.  these  rules  form  a  complete  class  within  the  set  of  all 
tlic  po>siblc  modular  scheduling  rules.  This  means  tiuit  Ibr  any  given  modular  scheduling  rule,  there  exists  a 
gcnerali/cd  setwise  scriali/ahlc  scheduling  rule  that  provides  at  least  as  much  concurrency. 

This  chapter  is  organized  as  follows.  In  Section  5.1.  wc  define  tlic  syntax  of  compound  transactions  and 
state  our  assumptions  about  them.  In  Section  5.2,  wc  define  generalized  setwise  serializable  scheduling  mlcs 
and  show  that  they  arc  consistent,  correct  and  modular.  In  addition  wc  show  that  generalized  setwise  serializ¬ 
able  scheduling  rules  form  a  complete  class  within  the  set  of  all  the  consistent  and  correct  modular  scheduling 
rules.  Section  5.3  is  die  summary  of  this  chapter. 


5.1  Syntax 


Wc  begin  our  formal  investigation  of  compound  transactions  by  first  defining  their  syntax.  A  compound 
tranviction  consists  of  a  partially  ordered  set  of  elcincnuiry  iransaclions.  l-.ach  clcmcnuiry  transaction  can 
have  die  stnicitirc  of  a  nested  transaction.  These  elementary  transactions  collectively  carry  out  the  task  of  the 
compound  transiiction  by  passing  information  via  local  variables.  For  simplicity,  wc  assume  that  all  elemen¬ 
tary  transactions  arc  single  level  transactions  in  the  following  discussion,  fhe  extension  of  the  results  to 
elementary  transactions  in  the  form  of  nested  transactions  is  straightforward.  It  can  be  done  via  the  concept 
of  linearization,  similar  to  the  procedure  used  in  Section  4.3. 

Definition  5.1:  A  compound  transaction  is  a  partially  ordered  set  of  elementary  transactions. 

Definition  5.2:  Let  T*  =  {t^, ....  i  be  an  elementary  transaction.  Let  the  data  object  accessed  by  step 

€  T*be  O  .Let  the  local  variable  associated  with  step  t. .  be  L  .  Step  L.  is  modelled  by  the  indivisible 
'  S.J  '•J  ‘g 

operation  of  the  following  two  instructions. 


L 

L  . 
IJ 


g  uj  ^  ul 


L.  ) 

L.  . 

ij 


where  Lp  is  the  set  of  local  variables  associated  with  all  the  preceding  elementary  transactions  in  the  same 
compound  transaction. 


488 


Wc  now  define  the  concepts  of  input  and  output  steps  of  an  clcmcntiiry  transiiction.  Next,  we  define  die 
concept  of  the  post'conditions  of  an  cicmcntiiry  transaction. 

Definition  5.3:  Step  t .  cT®  is  an  input  step  if  it  is  the  step  in  first  accessing  a  data  object  O.  Step  t.  is  an 

I  i  IJ 

output  step  if  it  is  the  step  in  i'l’last  iKXCSsing  O. 

Definition  5.4: 1.et  0.(V|^],  j  =  1  to  k,  be  the  values  input  to  the  inpet  steps  of  I'^and  O.IV|.].  j  =  1  to  k.  be 
the  values  output  by  the  output  steps  of '1'^.  1116  post-condition  of  is  a  specification  of  die  output  values  as 
functions  of  input  values  and  the  values  of  itx:al  variables  associated  with  the  steps  in  the  preceding  elemen¬ 
tary  transactions. 

OjIv,j  =  f(Lp.  0^(Vj^ . ^ 

where  is  the  set  of  local  variables  associated  with  the  steps  in  the  preceding  cIcmcnLiry  transactions  of 
the  stimc  compound  transtiction. 

Having  defined  the  syntax  of  compound  transactions,  we  now  state  our  assumptions  about  them. 

Assumption  5.1:  When  an  elementary  transaction  of  a  compound  transaction  is  executed  serially  and  in  an 
order  that  is  consistent  with  the  partial  order  defined  by  the  compound  transiiction.  it  satisfies  our  dirce 
fundamental  assumptions  about  a  transaction,  that  is.  it  terminates  (Al).  preserves  die  consistency  of  die 
database  {A2)  and  satisfies  its  own  post-conditions  (A3).  Furthermore,  the  post-condition  of  a  compound 
transaction  is  equivalent  to  the  conjunction  of  all  the  post-conditions  of  its  elementary  transactions. 

5.2  Generalized  Setwise  Serializable  Scheduling  Rules 

Having  defined  the  syntax  of  a  compound  transaction,  wc  now  define  the  concept  of  generalized  setwise 
serializable  scheduling  rules. 

Definition  5.5  Let  T  =  {T^ . T^}  be  a  transaction  system.  The  generalized  setwise  serializable  schedul¬ 

ing  rule  R  is  a  modular  scheduling  rule  with  the  following  properties. 

1.  R(T^ . TJ  =  (RJ(T^) . Rj(T_^)) 

2.  The  kernel  of  R.  R‘^,  i  =  1  to  n,  is  a  composite  function  which  first  maps  T.  into  a  compound 


489 


tninsiiction  1’^.  ;ind  then  partitions  the  steps  of  each  of  the  eleinentary  iians;\c lions  of  1'^  into 
transijction  ADS  segments. 

For  simplicity,  we  refer  to  the  kernels  of  the  generali/cd  setwise  serializable  scheduling  rule  as  generalized 
setwise  serial i/uible  schciiuling  ailcs. 

IXfinition  5.6:  A  schedule  /.  for  a  iransitciion  system  T  is  said  to  be  generalized  scf’viso  serializable  if  /. 
saltsfics  the  generalized  setwise  serializable  scheduling  aile  R.  'I’hat  is.  the  transaction  ADS  segments 
specified  by  R  in  one  elementary  transaction  arc  interleaved  scrializably  with  those  of  others  in  z. 

Theorem  5. 1 :  Generalized  setwise  serializable  schedules  arc  consistent  and  correct. 

Proof:  Since  a  generalized  setwise  serializable  schedule  is  setwise  serializable  with  respect  to  all  the  elemen¬ 
tary  transactions  in  the  system,  it  follows  from  Assumption  5.1  and  Corollary  4.2-1  tliat  each  elementary 
transaction  terminates,  preserves  the  consistency  of  the  database  and  produces  results  cliat  Siiiisfy  its  post¬ 
conditions.  Hence,  the  consistency  of  the  database  is  preserved  and  the  post-condition  of  each  of  the  com¬ 
pound  transactions  is  also  satisfied,  nicrcforc.  generalized  setwise  scrializal)lc  sciicdulcs  arc  consistent  and 
correct.  □ 

Corollary  5.1 :  Generalized  setwise  serializable  scheduling  ailcs  are  consistent  and  correct. 

Theorem  5.2:  Generalized  setwise  serializable  scheduling  rules  arc  modular. 

Proof:  It  directly  follows  from  Definition  5.5.  □ 

Thus  far.  we  have  shown  that  the  generalized  setwise  serializable  scheduling  rule  R  is  consistent,  correct 
and  modular.  We  now  prove  that  generalized  setwise  scheduling  rules  (i.e.  the  kernels  of  R)  form  a  complete 
class  within  the  set  of  all  the  consistent  and  correct  modular  scheduling  rules.  Before  proceeding  with  the 
proof  of  completeness,  we  need  to  introduce  the  concept  of  the  post-conditions  associated  with  an  atomic  step 
segment.  To  illustrate  the  need,  let  the  transaction  T  =  {t^j^:  A  :=  A  - 1;  L  A  :=  A  -I-  1}.  Iftlic  two  steps 
of  are  treated  as  a  single  atomic  step  segment,  then  i  ^  is  an  input  step  and  i  ^  is  an  output  step.  The 
partition  of  a  transaction  could  create  input  and  output  steps  in  addition  to  those  defined  in  an  executing 
alone  environment.  For  example,  if  each  of  these  two  steps  is  an  atomic  step  segment,  then  t.  ^  (i  p  is  both  an 
input  step  and  an  output  step. 


490 


IXMinitiun  5.7:  l.ct  a  -  {l.  ^ . t.  be  an  atomic  step  segment.  I  ct  tlie  data  object  accessed  by  step  t^.  €  a 

be  0.  Step  t.  j  is  an  input  step  if  it  is  tlic  step  in  a  first  accessing  O.  Step  i. .  is  an  output  step  if  it  is  the  step  in 
a  last  accessing  O. 


IXMlnition  5.8:  l.ct  j  =  1  to  k.  be  the  values  input  to  tlie  input  steps  of  a  and  0.(V|,1.  J  =  1  to  k.  be 
the  values  output  by  the  output  .steps  of  a.  'ITie  post-condition  of  a  is  a  specification  of  the  output  values  as 
functions  of  input  values  and  the  values  of  tlie  kxral  variables  as.sociatcd  with  the  steps  of  preceding  atomic 
step  segments  of  the  same  transaction.  'ITiat  is. 


0/v^  =  f.(L.O^[vJ. 


OJVjj]),j  =  Itok: 


where  I..  is  tlie  set  of  local  variables  associated  with  tlie  steps  of  the  atomic  step  segments  preceding  the 
output  step  for  Oj. 

We  now  prove  that  generalized  setwise  serializable  scheduling  rules  form  a  complete  class  within  the  set  of 
all  the  consistent  and  correct  modular  scheduling  rules. 


Lemma  5.3:  Let  R  be  a  modular  scheduling  rule.  R  is  consistent  and  correct  if  and  only  if 

1.  for  each  of  the  transactions  T.  in  T.  the  conjunction  of  the  post-conditions  of  all  the  atomic  step 
segments  in  is  equivalent  to  die  post-condition  iissociatcd  with  'I'.. 

2.  let  z  be  a  schedule  for  a  transiiction  system  T  C  T  and  z  satisfies  R.  In  die  execution  of  T 
according  to  z,  any  atomic  step  segment  in  T.  €  T  specified  by  R  preserves  the  consistency  of  the 
database. 


Proof:  First,  if  any  atomic  step  segment  <t  in  T.  specified  by  R  does  not  preserve  the  consistency  of  the 
database,  then  another  transaction  T.  executing  after  a  would  input  an  inconsistent  state.  Since  R  is  modular, 
we  can  define  the  semantics  of  T.  as  one  that  outputs  incorrect  results  when  its  input  is  inconsistent  Thus  R  is 
incorrect.  Second,  if  the  conjunction  of  the  post-conditions  of  all  the  atomic  step  segments  in  T.  specified  by 
R  is  not  equivalent  to  the  post-condition  associated  with  T.,  then  R  is  incorrect  by  definition.  Since  any 
schedule  z  for  any  tiansaction  system  T  C  T  satisfying  R  guarantees  the  serializability  in  the  execution  of  the 
transaction  atomic  step  segments  specified  by  R.  it  follows  from  conditions  1  and  2  that  z  is  consistent  and 
correct  □ 


491 


I  hodrcni  5.}:  Goncrali/cd  setwise  serial i/able  scheduling  rules  form  a  co/nplcic  dais  within  the  set  of 
modular  scheduling  niles. 

Proof:  1  ot  U  be  a  consistent  and  correct  modular  scheduling  rule.  Suppose  tiiat  '1'^  is  a  consistent  and 

correct  transaction  and  is  partitioned  into  atomic  step  segments  aj . cr^,  by  It.  First,  by  l.emma  5.3  i  = 

1  to  k.  must  preserve  tlie  consistency  of  the  database  when  executing  alone.  Second,  by  Lemma  5.3  tlie 

conjunction  of  the  post-conditions  of  . must  be  equivalent  to  the  post-ct'nditions  associ.ited  with  T^. 

Note  that  each  u.  i  =  1  to  k.  satisfies  the  definition  of  an  elementary  transaction.  We  now  define  a 

generalized  setwise  serializable  scheduling  rule  R  which  partitions  as  follows.  First,  R  labels  . as 

elementary  transactions.  Next.  R*  partitions  tlicsc  elementary  transactions  into  transaction  ADS  segments. 
Hence.  R  is  at  least  as  concurrent  as  R.  □ 

5.3  Summary 

In  tltis  chapter,  we  have  defined  the  concept  of  compound  transticiions  and  their  asMicia'cd  scheduling 
rules:  the  generalized  setwise  serializable  scheduling  iiilcs.  We  have  proved  that  iltcsc  ailcs  arc  consistent, 
correct  and  modular.  In  addition,  generalized  setwise  serializable -scheduling  rtiles  form  a  complete  class 
w  ithin  the  set  of  all  the  consistent  and  correct  modular  scheduling  rules. 

From  an  application  point  of  view,  it  is  important  to  point  out  that  Utc  process  of  mapping  a  transaction 
into  tlie  form  of  a  compound  transaction  requires  the  understanding  of  the  semantics  of  tlie  transaction.  We 
have  only  provided  the  syntax  of  compound  transactions  to  allow  a  user  to  accomplish  tliis  process.  It  is  tlie 
user's  responsibility  to  prove  that  he  has  correctly  expressed  his  transaction  in  the  form  of  a  compound 
transaction.  The  scheduling  of  compound  transactions  is,  however,  straightforward.  We  can  use  cither 
setwise  two  phase  lock  or  tree  lock  to  ensure  that  each  elementary  transaction  in  a  compound  transaction  is 
executed  setwise  serializably. 

From  a  programmer’s  point  of  view,  the  implication  of  our  completeness  result.  Theorem  5.3,  is  that  there 
exists  such  a  way  of  writing  compound  transactions  that  we  can  provide  at  least  as  much  concurrency  as  any 
other  modular  concurrency  control  method.  Unfortunately,  this  is  only  an  existence  theorem.  In  order  to 
ach  eve  a  high  degree  of  concurrency,  we  must  try  to  make  a  compound  transaction  consisting  of  many  small 
elementary  transactions  rather  than  a  few  large  ones.  Some  of  the  ideas  in  writing  and  scheduling  of  a 
compound  transaction  is  illustrated  by  an  example  in  Section  7.3.  Writing  Compound  Transactions. 


492 


6.  The  Failure  Safe  Rule 


In  the  previous  ch.iptois,  wc  have  studied  die  subject  of  concurrency  control  without  considering  the 
impact  of  possible  system  failures.  We  now  extend  our  work  by  taking  die  etTect  of  system  failures  into 
consideration.  Computer  systems  can  fail  in  many  ways.  We  limit  ourself  to  the  so-called  dean  and  soft 
failures  that  arc  traditionally  considered  by  transiiction  system  failure  recovery  mantigcmcni  IMcrnsioin  83J.  A 
failure  is  said  to  be  clean  and  soft  if  the  elTects  of  diis  failure  can  be  modelled  by  a  network  of  computers 
some  of  which  stop  running  and  lose  die  content  of  their  main  memories.  However,  the  databa.se  residing  in 
the  stable  storage  remains  inttict.  The  stable  storage  is  an  idcali/cd  model  of  a  network  of  system  disks  or 
other  reliable  storage  devices  whose  probability  of  failure  is  st)  smtill  that  such  failures  can  be  ignored  by  the 
applications  in  question. 

'nie  objective  of  using  failure  recovery  rules  is  to  ensure  diat  concurrency  control  can  be  carried  out 
consistently  and  correctly  despite  failures  in  the  system.  A  failure  recovery  rule  specifics  the  conditions  under 
which  the  executed  steps  of  a  transticiion  commit  or  abort.  Committing  a  sequence  of  executed  steps  means 
non-divisibly  transferring  die  result  of  computations  from  the  main  memory  to  the  database  residing  in  the 
stable  storage.  The  commit  operation  represents  irrevocable  changes  made  to  die  database.  For  example,  the 
recovery  rule  associated  with  serializable  schedules  is  known  as  failure  aionnciiy.  This  rule  requires  that  a 
transaction  cither  commits  or  aborts  all  executed  steps  at  the  end  of  execution.  The  non-divisibility  for 
committing  all  the  executed  steps  of  a  transaction  is  a  main  cause  of  low  concurrency.  In  order  to  ensure 
failure  atomicity  and  avoid  cascaded  aborts,  it  was  shown  by  Wood  [Wood  80]  that  none  of  the  locks  held  by  a 
transaction  can  be  released  until  the  transaction  has  successfully  committed.  This  is  because  a  transaction  can 
be  aborted  by  failures  before  it  has  committed.  To  release  a  lock  before  a  transaction  has  committed  could 
allow  other  transactions  to  use  the  results  of  a  partial  computation  that  could  be  invalidated  by  a  later  abon 
operation.  Cascaded  abort  is  generally  regarded  as  a  very  undesirable  event  and  should  be  avoided.  In  this 
work,  we  consider  only  recovery  rules  for  which  cascaded  aborts  do  not  occur. 

In  this  chapter,  we  develop  a  new  failure  recovery  rule,  the  failure  safe  rule,  which  is  designed  for  trans¬ 
action  systems  using  generalized  setwise  serializable  schedules.  Given  a  compound  transaction,  the  failure 
safe  rule  examines  the  transaction  ADS  segments  in  each  of  the  elementary  transactions.  If  the  steps  of  a 
transaction  ADS  segment  are  not  interleaved  with  those  of  others,  then  the  ADS  segment  is  named  an  atomic 
commit  segment.  Otherwise,  interleaved  ADS  segments  are  taken  to  be  a  single  atomic  commit  segment.  That 


493 


is,  the  failure  safe  rule  partitions  a  transiiction  into  a  partially  ordered  set  of  atomic  commit  segments.  A 
transiiction  must  non-divisibly  commit  cadi  of  its  atomic  commit  segmenLs  in  an  order  consistent  with  the 
partial  order  of  commit  segments  in  the  transiiction.  As  an  cx;‘in''ic.  the  compound  transiiction  Get-A-and-B 
discussed  in  the  overview  of  tliis  thesis  has  four  atomic  commit  segments,  each  of  which  corresponds  to  one  of 
the  four  elementary  transactions:  Get-A,  Get-B,  lAit-Back-A  and  Put*Back*B.  because  each  of  these  elemen¬ 
tary  transiictions  consists  of  a  single  transaction  ADS  segment.  It  should  be  pointed  out  that  failure  atomicity 
is  a  degenerate  case  of  our  failure  stifc  rule  in  that  it  corresponds  to  die  rule  in  which  the  entire  set  of  steps  of 
a  transaction  is  taken  as  a  single  atomic  commit  segment 

From  a  concurrency  control  point  of  view,  a  concurrent  execution  of  a  transaction  system  is  modelled  by  a 
schedule.  It  is  shown  in  this  chapter  that  owing  to  the  preservation  of  both  the  internal  ordering  of  steps  in 
transactions  and  the  integrity  of  traasiiction  ADS  segments,  the  effect  of  a  system  failure  under  die  failure  safe 
rule  becomes  equivalent  to  changing  an  existing  gcncrali/.cd  setwise  serializable  schedule  to  another  schedule 
that  is  also  generalized  setwise  serializable.  Thus,  under  the  failure  safe  rule  system  failures  arc  safe  in  the 
sense  that  the  compuutions  recorded  in  the  database  are  equivalent  to  the  computations  resulting  from  a 
failure-free  execution  of  the  transaction  system. 

In  this  chapter,  the  organization  of  studying  failure  recovery  rules  is  parallel  to  that  of  our  studying 
concurrency  control  rules  in  the  previous  chapters.  In  Section  6.1,  we  develop  a  model  of  failure  recovery 
rules  and  define  their  properties  such  as  consistency,  correctness  failure  safety  and  optimality.  In  Section  6.2 
we  define  the  failure  safe  rule  and  pTovc  diat  this  rule  is  consistent,  correct  and  modular.  In  addition,  we 
prove  that  this  rule  is  optimal  for  transaction  systems  scheduled  by  generalized  setwise  serializable  scheduling 
rules.  In  Section  6.3,  we  summarize  this  chapter. 

6.1  A  Model  of  Modular  Failure  Recovery  Rules 

We  now  develop  our  model  of  modular  failure  recovery  rules.  We  first  define  the  concept  of  a  modular 
failure  recovery  rule.  Second,  we  model  the  effect  of  clean  and  soft  failures  upon  concurrency  control.  We 
then  define  the  concept  of  consistency  and  correctness  for  a  modular  failure  recovery  rule  in  the  face  of  clean 
and  soft  failures.  Next,  we  introduce  the  concept  of  safety  for  a  recovery  rule.  We  conclude  this  section  by 
addressing  the  issues  of  re-scheduling  aborted  transactions. 


A  failure  recovery  rule  is  a  function  which  takes  a  tr;iiisactir)n  and  partitions  it  into  equivalent  classes  called 
atomic  commit  segments.  Ac  the  end  of  executing  an  atomic  commit  segment,  all  executed  steps  of  this 
segment  must  be  either  committed  or  aborted  as  an  indivisible  unit,  A  failure  recovery  rule  is  said  to  be 
modular  if  it  partitions  a  transaction  independently  of  other  transactions  in  the  system.  For  example,  failure 
atomicity  is  a  modular  failure  recovery  rule  tliat  takes  the  entire  set  of  steps  of  a  transaction  as  a  single  atomic 
commit  segment.  We  now  formalize  our  concept  of  a  failure  recovery  rule  in  Definitions  6.1-1  and  6.1-2. 

Definition  6.1-1:  Let  denote  the  set  of  ail  the  possible  consistent  and  correct  transactions  with  m  steps, 
[.et  T  denote  the  set  of  all  the  consistent  and  correct  transactions,  that  is.  T  =  ,T  .  I.et  P  denote  a 
partition  of  an  m-step  consistent  and  correct  transaction  into  a  partially  ordered  set  of  transaction  step 
segments  called  atomic  commit  segments.  I.et  9^  denote  the  set  of  all  the  possible  sets  of  partitions  of  an 
m-step  consistent  and  correct  transaction.  Let  9  be  the  set  of  all  the  possible  partially  ordered  sets  of 
partitions,  that  is.  ‘5’  =  U  ■  A  failure  recovery  niie  for  a  transaction  svstem  with  n  transactions.  “DIj  .  is  a 

m=  L  m  '  n 

function  which  takes  the  transaction  system  of  size  n  and  partitions  each  of  the  transactions  into  a  partially 
ordered  set  of  atomic  commit  segments. 

Definition  6.1-2:  A  failure  recovery  rule  “Dlj  is  a  function  which  takes  a  transaction  system  of  any  size  and 
partitions  each  of  the  transactions  into  a  partially  ordered  set  of  commit  segments. 

such  that  the  restriction  of 'tRi  to  H  1=1  T  is  i.c. 

T  =  n  =  1  to 00. 

Definition  6.2:  A  failure  recovery  rule  %  is  said  to  be  modular  if  and  only  if  independently  partitions 
each  transaction  in  a  transaction  system  into  atomic  commit  segments.  Tliat  is, 

. TJ  =  (gi,^(T^) . n  =  1  to  oo. 

where  is  the  restriction  ofDb  to  H 


495 


Having  formalized  the  concept  of  modular  recovery  rules,  we  now  state  our  assumption  regarding  the 
atomic  commit  segments  produced  by  tliosc  recovery  rules.  When  failure  atomicity  is  adopted,  the  commit 
operation  ensures  tiiat  the  computations  produced  by  a  transaction  arc  either  discarded  or  transferred  to  the 
database  as  a  non-divisible  unit  In  this  case,  the  values  stored  in  the  local  variables  arc  irrelevant  to  the 
commit  operation,  because  the  computition  has  been  completed.  However,  when  we  commit  a  transaction 
segment  by  segment,  we  must  not  only  guarantee  that  the  computations* produced  by  an  atomic  commit 
segment  arc  non-divisibly  transferred  to  the  database,  but  also  guarantee  that  the  values  stored  in  the  local 
variables  associated  with  the  commit  segment  are  non-divisibly  transferred  to  the  stable  storage  as  well.  This 
is  because  when  an  aborted  transaction  resumes  its  execution,  an  executing  step  may  need  the  values  of  the 
local  variables  associated  with  those  commit  segments  already  committed.  We  now  formalize  our  assumption 
regarding  the  commit  and  abort  operations  of  atomic  commit  segments. 

Assumption  6.1-1:  At  the  end  of  executing  an  atomic  commit  segment,  the  computations  produced  by  this 
segment  will  be  either  commuted  or  aborted. 


Assumption  6.1-2:  An  atomic  commit  segment  cr  is  said  to  be  committed and  only  if 

1.  the  computations  produced  by  this  segment  will  be  non-divisibly  transferred  to  the  database.  That 

is,  let  the  data  objects  accessed  by  atomic  commit  segment  a  be  . and  the  values  output 

by  <7  be  OJv^ . values  of  the  data  objects  in  the  database  will  be  OJVjJ,  OJvp  if 

a  is  committed. 

2.  The  values  stored  in  the  local  variables  associated  v^ith  segment  a  will  also  be  non-divisibly 
transferred  to  the  stable  storage.  ITiat  is.  let  L^, ....  L^,  be  the  set  of  local  variables  associated  with 

the  transaction  steps  in  cr.  Let  JL^ . be  a  set  of  data  objects  in  the  stable  storage  but  not  part 

of  the  database.  Data  objects  1^,  ..„  arc  said  to  be  the  private  stable  storage  for  a.  The  private 
stable  storage  of  an  atomic  commit  segment  a  can  only  be  written  by  the  commit  operation  of  <r 
and  can  only  be  read  by  steps  of  other  commit  segments  in  the  same  transaction.  When  a  has 
committed,  we  have  . =  L^. 

Assumption  6.1-3:  An  atomic  commit  segment  is  said  to  be  aborted  if  and  only  if  its  computations  are 
discarded.  That  is, 

1.  Let  0  Jvjj] . O  Jv^  be  the  values  of  data  objects  . Oj^  input  to  atomic  commit  segment  a. 

The  values  of  data  objects  0^ . 0^  remain  to  be  O  ...,  O  Jv^^  respectively. 

2.  The  private  stable  storage  associated  with  a  has  the  initial  value  ’’nil”  for  each  of  the  data  objects 

The  value  "nii"  is  special  value  reserved  for  recovery  management  such  as  the  bit 


496 


pattern  of  a  word  with  all  I's.  "nil*'  must  not  be  in  tlic  domain  of  any  data  object.  When  the 
private  stable  storage  Ij . arc  a-ssigned  to  a  commit  segment,  their  initial  values  arc  initial¬ 

ized  to  "nirs".  lliis  indicates  that  llicy  have  not  been  used  to  store  values  by  the  commit  segment. 


Assumption  6.1-4:  When  resuming  the  execution  of  an  aborted  atomic  commit  segment  a  cT.,  all  the  local 
variables  associated  with  commiiied  atomic  commit  segments  in  T,  will  be  restored  to  values  saved  in  their 
private  stable  storages.  That  is,  let  ....  <t^  be  the  atomic  commit  segments  that  have  been  committed.  We 
have 

V(L€a.,l<J<k)(L.  =i..) 


Having  defined  modular  recovery  rules  and  the  commit  operations  for  atomic  commit  segments,  we  now 
model  the  effect  of  a  clean  and  soft  system  failure  upon  concurrency  control.  When  a  transaction  system  is 
executed  according  to  a  schedule  z  and  a  failure  occurs,  schedule  z  is  partitioned  into  two  parts.  The  part  z® 
represents  the  steps  in  z  that  have  been  executed,  another  part  z^ represents  those  steps  yet  to  be  executed.^ 

Definition  6.3:  Given  a  schedule  z  for  transaction  system  T,  a  clean  and  soft  system  failure  partitions  z  into 
two  parts;  z®  and  z^.  Partial  schedule  z®  represents  steps  of  z  that  have  been  executed  prior  to  the  failure  and  z^ 
represents  steps  in  z  yet  to  be  executed. 

Within  an  executed  panial  schedule  z®.  there  can  be  some  atomic  commit  segments  which  have  been 
committed.  The  computations  represented  by  successfully  committed  atomic  commit  segments  is  modelled 
by  a  committed  partial  schedule  z®. 

Before  we  formally  define  this  concept,  we  need  first  to  introduce  a  useful  concept  called  sub-segment, 
denoted  as  "C".  We  say  that  a  sequence  of  steps  is  a  sub-segment  of  another  sequence  of  steps  if  and 
only  if  all  the  steps  in  are  also  in  In  addition,  the  ordering  of  steps  in  is  consistent  with  that  in 

Definition  6.4;  Let  be  a  sequence  of  transaction  steps.  A  sequence  of  steps  is  said  to  be  a  sub-segment 

of  denoted  as  if  and  only  if 


is  imponant  lo  recall  that  each  step  is  assumed  to  be  aecuted  atomically.  Thus,  wc  do  not  need  to  deal  with  a  "partially  executed" 
step  in  our  model 


497 


V  ((I..L  €  i  ^  j)  A  (t.  >  t.))  ((t.t.  €  <y  A  (t.  >  L)) 


Wc  now  define  the  concept  of  a  committed  partial  schedule. 

Definition  6.5:  Let  be  the  executed  partial  schedule  in  z.  Let  a  be  an  atomic  commit  segment.  The 
committed  partial  schedule  z*^  is  a  sub-segment  of  z®  that  contains  only  successfully  committed  atomic  commit 
segments.  That  is, 

1.  The  committed  partial  schedule  is  a  sub-segment  of  the  executed  partial  schedule. 
z'Cz*; 

2.  The  committed  partial  schedule  does  not  contain  fragmented  atomic  commit  segments. 
V(t€z®x3(<TCz®Xt€  a)) 

3.  Transaction  commit  segments  belonging  to  a  transaction  are  committed  in  the  order  defined  by 
the  commit  rule.  We  let  <  a^"  denote  that  commit  segment  precedes  <s^. 

V (T,  €  T)  V ( (a^.  aj  C  T^)  A  <  ep )  ( C  z')  -  (a^  C  i^) ) 

Having  addressed  the  issues  related  to  the  commit  operation,  we  now  turn  to  the  subject  of  resuming  the 
execution  of  an  aborted  transtiction.  When  a  transaction  fails,  some  of  its  atomic  commit  segments  may  have 
been  already  committed.  When  tiie  database  system  resumes  the  execution  of  this  aborted  transaction,  we  are 
facing  the  problem  of  scheduling  a  partial  transaction.  We  assume  that  a  partial  transaction  is  scheduled  by 
the  same  scheduling  rule  R  as  follows.  When  a  partial  transaction  consists  of  an  integer  number  of  atomic 
step  segments,  ail  we  have  to  do  is  to  ensure  that  these  segments  are  interleaved  scrializably  with  those  of 
other  transactions.  The  main  problem  in  scheduling  a  partial  transaction  is  that  an  atomic  step  segment 
specified  by  R  might  be  partitioned  by  a  recovery  rule  *311  into  more  than  one  atomic  commit  segments.  When 
a  failure  occurs,  it  is  possible  that  only  some  of  tltcse  corrunit  segments  have  been  committed.  In  tliis  case,  the 
portion  of  an  atomic  step  segment  left  in  the  remaining  schedule  z^  will  be  taken  as  an  atomic  step  segment 
and  interleaved  scrializably  with  atomic  step  segments  of  other  transactions  in  z^. 

For  example,  suppose  that  transactionT.  has  two  atomic  step  segments  and  specified  by  scheduling 
rule  R.  If  the  entire  T^  is  in  a  remaining  partial  schedule  z^,  then  these  two  atoinic  step  segments  will  be 
interleaved  scrializably  with  atomic  step  segments  of  other  transactions  in  z^.  If  <j^  is  committed  but  is  left 


498 


in  /|.  then  a,  will  be  imcrlcavcd  scriali/ably  with  ihosc  of  others  in  I  'inally.  suppose  lliat  a  recovery  rule 
pariiiions  the  atomic  step  segment  (T|  into  two  atomic  commit  segments:  a  ^  |  and  Oj If  only  |  has  been 
committed  and  ,  and  arc  left  in  /|,  then  both  ^  and  a,  will  be  taken  as  atomic  step  segments  in  /j. 
That  is.  Uj ,  and  will  be  interleaved  seriali/ably  with  those  of  others  in  /,|.  We  (bmiali/e  our  discussion 
about  scheduling  aborted  transactions  as  Assumption  6.2. 

AssumiUion  6,2:  Let  /  be  a  schedule  for  transaction  system  T  Siitisfying  scheduling  rule  U.  Let  /.j  be  die 
remaining  partial  schedule  after  a  failure  (K'curs  during  the  execution  of  /.  Let  IV)  denote  the  atomic  step 
segments  of  transtiction  T  specified  by  U. 

1.  ITie  steps  in  arc  diosc  in  /.  but  not  in  the  committed  partial  schedule 
V(t)((t€/.)A(t€z‘^))-*(t€Z|)) 

2.  If  all  the  steps  of  a  transiiction  L.  is  in  z^.  then  tlic  atomic  step  segments  of  T.  arc  still  represented 
by  r^CL). 

3.  Let  T|’  denote  the  partial  transaction  of  T.  in  the  remaining  partial  schedule  Zj.  Partial  transaction 
rj’  is  a  sequence  of  steps  such  diat  ('l'[’C r.)A(  V(t)  (  ((t€  rj)A(t€/|))  (te  l'’)  ).  Partial 
transaction  Tj’  is  scheduled  by  U  as  follows,  l.ct  !  [*)  denote  die  atomic  step  segments 
specified  by  R  with  respect  to  the  panial  transaction  I'j’.  We  have 

a.  V{T|’C/.^)V(<T€l/r|’))(  3(a’€i^(T))(ffC(T’)) 

b.  V (cj*  €  i  j^(T))  V ((t  €  a’)  A  (t  €  T[’))  ( 3  (<T  €  Z^{  rP))(t  €  a)) 

4.  The  remaining  partial  schedule  z^  satisfies  scheduling  rule  R.  That  is.  the  atomic  step  segments 
specified  by  R  in  a  (partial)  transaction  will  be  interleaved  seriali/ably  with  tliosc  of  others  in  z^. 

Having  modelled  the  effect  of  a  single  failure,  we  now  address  the  issue  of  multiple  failures.  When  the  first 
failure  occurs  during  the  execution  of  z.  the  committed  atomic  commit  segments  arc  represented  by  z*^.  The 
executed  but  not  yet  committed  transaction  steps  are  aborted.  These  aborted  steps  are  re-schedulcd  together 
with  steps  in  z^  to  create  the  remaining  partial  schedule  z^.  In  the  execution  of  z^.  suppose  that  the  second 
failure  occurs.  In  this  case,  we  have  a  new  committed  partial  schedule,  z^.  and  a  new  remaining  schedule  z^. 
This  process  continues  until  all  the  steps  in  the  the  transaction  system  arc  executed  and  committed. 

Definition  6.6-1:  Let  z  be  a  schedule  of  transaction  system  T  satisfying  scheduling  rule  R.  When  the  first 


fiiiUirc  tK'Cure.  ilic  coinmiiicd  partiul  stiicdulo  is  denoted  ;us  /.^  and  the  remaining  partial  schedule  is  denoted 
•as  z^.  The  k'*'  remaining  partial  schedule  is  the  remaining  schedule  f('r  /.j^  j  after  the  k'*'  failure. 


Definition  6.6-2:  When  there  are  no  failures,  the  execution  of  a  trans^iciion  system  is  modelled  by  a 
scheduling  with  n  failures  the  execution  of  a  transaction  system  is  modelled  by  a  committed  schedule  z(n) 
=  (z^ 

In  order  to  investigate  the  execution  of  transiiction  systems  in  the  face  of  failures,  we  must  relate  the 
concept  of  a  committed  schedule  to  our  esuiblished  results  of  scheduling  rules. 

rheorem  6,1:  l  et  a  committed  schedule  for  a  given  total  of  k  failures  in  die  execution  of  transiiction  system 

T  =  {T| . T^}  be  z(k)  =  <z^  /.^ . /^>.  where  /  is  the  i’**  committed  partial  schedule  alter  titc  i'*’  failure.' 

Committed  schedule  /.(k)  satisfies  scheduling  rule  R  if  and  only  if 

1.  llic  ordering  of  the  steps  of  transiiction  I'..  1  <  i  <  n.  in  /.(k)  is  consistent  with  that  in  irans.iclion 
r ,  I  <  i  <  n. 

(V(tl((f  €  €  UT))I  A  [V(T  €  T>V((t  .1.^  e  T.)  A  >  ij.) )  (  dj;.  t; €  /(k)) 

A(t.,>t.))l 

i.k  i.j'  ' 

2.  Atomic  step  segments  specified  by  R  and  belonging  to  different  transactions  are  interleaved 
scrializably  in  z. 

3  (z6  Z(T))  ( (z(k)=  z)  A  (z  is  atomic  step  segment  serial) ) 


Proof:  It  directly  follows  from  Assumptions  6.1-1.  6.1-2,  6.1-3  and  6.1-4,  Definitions  6.6-1  and  6.6-2,  and 
Definitions  2.10  and  2.16.  □ 


Having  modelled  the  effect  of  failures  upon  concurrency  control,  we  are  now  in  a  position  to  define  the 
concept  of  consistency  and  correctness  of  a  recovery  rule.  We  consider  a  recovery  rule  %  designed  for  a 
consistent  and  correct  scheduling  rule  R  is  consistent  and  and  correct  if  and  only  if  ensures  the  consistency 
and  correctness  of  the  committed  schedules  in  the  face  of  system  failures. 


Definition  6.7:  Let  %  be  the  recovery  rule  designed  for  the  scheduling  rule  R,  recovery  rule  *JI»  is  said  to  be 
consistent  and  correct  if  and  only  if 


Slote  that  because  there  arc  no  more  failures  after  the  k'*'  failure  and  the  entire  is  committed. 


500 


V  (/.<;  /  (  I  ))  (/(n)  is  consislcnt  jnd  correct,  0  <  n  <  oo) 

K 

where  n  I'i  die  set  of  all  the  schedules  for  transaction  system  T  s.itisfving  R. 

We  now  define  .in  important  proiierty  of  a  recttseiy  rule  cailed  safety.  We  consider  a  recovery  rule  % 
designed  for  scheduling  rule  R  as  being  safe  if  and  only  if  Utc  committed  schedules  satisfy  scheduling  rule  R. 

That  IS.  the  computations  recorded  in  the  dauibasc  canntit  be  distinguished  from  tliosc  resulting  from  .in 
execution  of  a  schedule  /€  )  vvithout  failures. 

Dcfinuior,  h.8:  A  recovery  rule  OIj  designed  for  a  scheduling  rule  R  is  said  to  be  safe  if  and  only  if  for  any 
given  schedule  /  /...(  f)  the  commit  schedules  satisfv  R.  i.c. 

V  (/.  r  /  ( DX/iii)  s  /  (T),  0  <  n  <  oo) 

lx  K 

Ihcoicm  1-1.2:  If  recovery  rule  OIj  designed  for  a  consistent  and  correct  scheduling  rule  R  is  safe.  Uien 
recovery  rulc^  is  consistent  and  correct. 

Proof':  Since  all  the  schedules  satisfying  scheduling  rule  R  arc  consistent  and  correct,  it  follows  from 
Dehnitions  6.''  and  6.8  that  is  consistent  and  correct.  □ 

Wc  now  turn  to  the  subject  of  optimality  of  modular  and  safe  recovery  rules.  We  measure  the  concurrency 
provided  by  a  modular  recovery  rule  by  how  fine  it  partitions  a  transaction.  This  is  because  compuutions 
produced  by  an  atomic  commit  segment  must  be  withheld  by  locks  or  other  mechanisms  until  this  segment  1 
has  been  committed.  To  allow  other  transactions  using  the  results  produced  by  a  commit  segment  before  its 
commit  could  lead  to  cascaded  aborts  -  a  highly  undesirable  event.  Generally,  the  finer  the  partition,  the 
smaller  is  the  size  of  a  commit  segment  and  thus  the  higher  the  degree  of  concurrency. 

Definition  6.9:  Let  T  be  the  set  of  all  the  consistent  and  correct  transactions.  Let  IR  be  the  set  of  all  the 
modular  and  safe  recovery  rules  associated  with  some  eonsistent  and  correct  modular  scheduling  rule  R.  Let 
denote  the  partition  resulting  from  'iAeK  j./uri.nuning  T^eT.  .Modular  and  safe  recovery  rule  ‘lA  is 
said  to  be  optimal  with  respect  to  the  associated  scheduling  rule  R,  if  and  only  if  01)  always  produces  the  most 
refined  partitions.  That  is.  fA  is  optimal  if  and  only  if 


501 


V cr  € T) V Cife’ €  iR) ( (a € icj^CI',))  -  (<y  € ) • 

where  a  denotes  an  atomic  commit  segment. 

We  now  conclude  this  section  by  addressing  die  issues  in  re-scheduling  those  aborted  trans.ictions.  I'hc 
main  point  is  that  an  atomic  commit  segment  must  be  a  superset  of ‘atomic  step  scgmcnt.s. 

Theorem  6,3:  An  atomic  commit  segments  produced  by  a  modular  and  safe  recovery  rule  must  be  a 
superset  of  atomic  step  segments  produced  by  the  associated  scheduling  ailc  R. 

Proof:  Suppose  that  this  claim  is  false  and  there  exists  a  modular  and  safe  recovery  rule  ‘Jij  .  which  divides 
atomic  step  segment  of  transaction  Tj  into  n  commit  segments  with  n  >  2.  I  ct  the  commit  segments  be 

<a^  j . ^1  n^‘  *  objects  read  or  written  by  be  >(.  I  et  transaction  consists  of  only  one 

atomic  step  segment  <t,  which  writes  into  every  data  object  in  -4..  Now  consider  a  schedule  /  for  uansaction 
system  T  =  {T^,  r^}.  Suppose  that  schedule  /.  siitisfies  R  and  we  execute  T  according  to  z  in  which  l\  is 
executed  last  Suppose  that  a  failure  (x:curs  just  after  ^  has  been  committed  and  commit  segments  ... 

^  arc  aborted,  l.ct  the  remaining  partial  schedule  /.^  be  2 . n^-  Suppose  that  there  arc  no  more 

failures.  That  is,  z(l)  is  the  committed  schedule.  Note  that  7.(1)  docs  not  satisfy  R.  This  is  because  in  /(I)  = 

<aj  j.  a^2 . n^-  *^1  precedes  (t,  but  <t,  also  precedes  on  A.  That  is.  the  atomic  step  segments  of  Tj 

arc  not  interleaved  scnali7.ably  with  that  of  T^.  This  contradicts  the  assumption  that  %  is  safe.  □ 

6.2  The  Failure  Safe  Rule 

Having  developed  a  model  of  modular  recovery  rules,  we  now  define  our  failure  safe  rule  and  prove  that 
this  rule  is  modular,  consistent,  correct  and  safe.  We  conclude  this  section  by  showing  that  this  rule  is  optimal 
for  transaction  systems  using  generalized  setwise  serializable  schedules. 

When  we  study  schedules,  a  useful  concept  is  the  scriaiizability  of  interleaving  the  atomic  step  segments  of 
one  transaction  to  those  of  other  transactions.  In  the  study  of  failure  recovery,  an  important  concept  is  the 
interleaving  of  transaction  ADS  segments  within  the  same  transaction.  When  transaction  ADS  segments  of  a 
given  transaction  arc  interleaved,  they  must  be  committed  as  a  single  atomic  commit  segment  in  order  to 
preserve  the  internal  ordering  of  steps  in  a  transaction. 


502 


I  K'l'inuion  h.  ID:  I  ct  (  1'^)  denote  llio  see  of  transoclion  ADS  seginei'ts  rcsuliiiig  from  a  generalized  setwise 
^clK'dtllme  role  K  parinuminii  1'.  A  traiisaeium  ADS  semneiu  a  eZ  ('I')  is  s<iid  to  be  iiiierlnivcil  with 

r  ^  '  -  I  ;  I 

transaction  ADS  sceinent  0  €Z  (T  )  if 

J  g  > 

3(tf 0 )(t'  < i< t'^j) 

\  a  a 

J  J 


where  t‘  and  l"^iarc  die  first  and  l;i>t  steps  in  segment  0  respectively. 

J 


Definition  6,1 1:  Given  a  transaction  I  .  cT  scheduled  by  a  generalized  setwise  serializable  scheduling  mle 
R  [Ik  fiiihtrc  :nifc  n//V‘D!)|-is  a  function  that  produces  a  partiallv  ordered  set  of  iitoniic  commit  segments  as 
follows: 

1.  For  each  transaction  ADS  segment  0  in  an  elementary  transaction  of  IF  name  a  as  an  atomic 
commit  segment  if  0  is  not  interleaved  with  any  other  irans.iction  .ADS  segment  in  this  elementary 
transaction. 


2.  When  two  or  more  transaction  ADS  segments  in  an  elementary  transaction  arc  interleaved  with 
each  other,  name  tlicm  as  a  single  atomic  commit  segment. 


’ITicorcm  6.4:  Tlio  failure  safe  rule  is  modular. 

Proof:  It  directly  follows  from  Definitions  6.2  and  6.11.  □ 

Theorem  6.5:  The  failure  safe  rule  %  is  safe.  That  is.  undcr'Jlj  wc  have. 

V(TCT)  V(z€  (T))  (/.(n)  is  generalized  setwise  serializable,  0  <  n  <  00):  < 

g 

where  /.( n)  =  <z^. ...  z^>  is  a  committed  schedule  for  a  given  total  of  n  failures. 

Proof:  Bv  Theorem  6.1.  to  show  that  z(n)  =  </.'^ . z'>  is  generalized  setwise  serializable  for  0  <  n  <  00, 

'  n  —  — 

we  need  to  prove  that  the  ordering  of  steps  in  z(n)  is  consistent  with  the  internal  ordering  of  steps  in  each  of 
the  compound  transactions.  In  addition,  all  the  transaction  ADS  segments  in  an  elementary  transaction  must 
be  interleaved  serializably  with  those  of  others  in  z(n). 


To  prove  that  the  internal  ordering  of  transaction  steps  in  a  compound  transaction  is  preserved  ,  wc  need  to 
prove  diat  the  ordering  of  the  elementary  transactions  of  a  compound  tran.saction  in  z(n)  i.s  consistent  with 


503 


ihai  in  ihc  compoimd  trans.iclii)n  and  dial  die  iiiiernai  ordering  of  sle|is  in  each  of  the  cicmcnLiry  transactions 
is  prcser\cd.  lU  Ik'finiiion  6.1 1,  die  partial  ordering  of  atomic  commit  scgmciiLs  is  consistent  with  die  partial 
ordering  of  elementary  transactions  in  .i  compound  tran.saction.  Since  atomic  commit  segmenLs  arc  com¬ 
mitted  in  an  order  consistent  with  the  partial  ordering  of  atomic  commit  segments,  the  ordering  of  elementary 
tr;ins;tctions  of  a  compound  transaction  in  /(n)  is  consistent  with  die  ordering  of  diem  in  the  compoimd 
transaction. 

To  complete  the  proof  diat  the  internal  ordering  of  steps  in  a  compound  transaction  is  preserved,  we  need 
to  show  that  the  ordering  of  steps  in  each  of  the  elementary  transactions  is  also  preserved.  !.et  L  ^  and  L  ^  be 
two  steps  ill  an  elementary  transaction  'I"®  and  t. <  t.  .We  need  to  show  that  t. .  <  t.  in  /.(n).  I’hcrc  arc  two 
eases.  First,  suppose  that  t.  and  t.^  arc  in  the  same  commit  segment  <t.  Since  a  C  /(n)  and  die  commit 
segment  a  is  die  superset  that  contains  die  transaction  ADS  segments  in  which  t.  <  t.  it  follows  that  t.  ^  < 
t  in /(n).  Second,  suppose  that  t  ,  is  in  commit  segment  <t  while  t.  is  in  a  and  a  <  <r  .  Since  <7  <  <y  in 
/  .  it  follows  that  t  .  <  t,  in  z(n). 

We  now  prove  that  all  the  transaction  ADS  segments  in  an  clcmcnuiry  transaction  arc  interleaved  scrializ- 
ably  with  those  of  others  in  z(n).  First,  it  follows  from  Assumption  6.2  that  steps  of  clcmcnuiry  transactions 
arc  interleaved  setwise  seriali/ably  in  z  and  in  the  remaining  partial  schedules  Zj,  1  <  i  <  n.  Since  C  /.  and 
^  that  transaction  ADS  segments  arc  interleaved  scrializably  with  those  of 

others  in  each  of  die  committed  partial  schedules  z'.  I  <  i  <  n.  We  now  claim  that  all  the  transaction  ADS 
segments  in  an  elementary  transaction  arc  interleaved  seriali/ably  with  those  of  others  in  z(n).  If  we  suppose 
that  this  claim  is  false,  then  uhere  must  exist  at  least  two  transaction  ADS  segments  in  an  elementary  trans¬ 
action  such  that  these  two  segments  are  interleaved  non-serializably.  Let  these  two  ADS  segments  be  and 
(Tj.  There  are  two  cases.  First,  and  are  in  the  same  committed  partial  schedule.  This  contradicts  the 
result  that  all  the  ADS  segments  of  an  elementary  transaction  are  interleaved  serializably  with  those  of  others 
in  the  same  committed  partial  schedule.  Second,  suppose  that  and  different  committed 

partial  schedules  and  z^  and  that  z^  <  z®.  By  the  assumption  that  these  two  transaction  ADS  segments  are 
non-scrializablc,  there  must  exist  some  steps  t^^  >  t^^j^  in  and  some  steps  t^^  ^  ^in  "Hj  ^ 

and  ”t^^  >  t^j,".  However,  this  contradicts  the  fact  that  all  the  steps  in  committed  partial  schedule  z' 
precede  those  in  z^.  Hence,  all  the  transaction  ADS  segments  of  an  elementary  transaction  are  interleaved 
scrializably  with  those  of  otlicrs  in  z(n).  □ 


504 


Corollary  6.5:  ITic  failure  Mfc  rule  is  consistent  and  correct. 

Ilieorcm  h.fi:  llic  failure  Siife  rule  ‘Jl>  is  the  optimal  failure  recovery  rule  within  ilic  set  of  all  the  modular 
and  safe  failure  recovery  rules  for  iransiiction  systems  using  gcncrali/.cd  setwise  scriali/abic  scheduling  rules. 

FrtK)f:  '['(j  prove  tlic  optimality  of  the  failure  safe  rule  ‘Jl).  we  need  to  show  Uiat  the  commit  segments 
pnvluccd  by  Ob  for  any  trans;K:tion  is  the  most  refined  partition.  There  arc  two  cases.  Kirst.  a  ir.insaction  ADS 
segment  of  transaction  'T,  could  be  tiikcn  as  a  commit  segment  by  Ob.  By  Theorem  6.3.  tliis  commit  segment 

cannot  be  further  partitioned.  In  the  second  case,  transaction  ADS  segments  . in  an  elementary 

transaction  of  T.  arc  taken  as  a  single  commit  segment  by  Ob.  because  they  arc  interleaved.  Suppose  dial  dtey 
arc  interleaved  and  taken  as  a  single  »itomic  commit  segment  a.  Suppose  that  there  exists  a  modul.ir  and  safe 

recovery  rule  Ob  which  divides  <t  into  more  than  one  atomic  commit  segments,  j . where  n  >  2.  liy 

nKorem  6.3.  each  of  these  commit  segments  must  contain  an  integer  number  oTiransiiction  ADS  segments. 
Therefore.  j, ....  must  be  supersets  of  tran.saction  ADS  segments.  I.et  the  first  failure  occur  just  after 
the  commit  of  and  let  there  be  no  more  failures.  Suppose  that  die  ordering  of  steps  of  T.  in  /.(I)  is 
consistent  with  the  internal  ordering  of  steps  in  T..  Iliis  contradicts  the  assumption  that  the  steps  of  die 
transaction  ADS  segments  in  <t  arc  interleaved,  because  in  /.(I)  ail  the  steps  of  ^  in  /.^  precede  all  the  steps 

in  <7^2 . in  it  follows  that  steps  of  T.  in  z(l)  violates  the  internal  ordering  of  steps  in  T.  and 

therefore  z(l)  is  not  generalized  setwise  serializable.  This  contradicts  the  assumption  that  Sb  is  safe.  'ITius. 
the  commit  segments  produced  by  '3b  -for  any  transtiction  IV  cannot  be  refined  by  other  modular  and  safe 
recovery  rules.  □ 

6.3  Summary 

In  this  chapter,  we  have  shown  that  when  the  scheduling  rule  is  the  generalized  setwise  serializable  schedul¬ 
ing  rule,  the  optimal  modular  and  safe  recovery  rule  is  the  failure  safe  rule.  Given  a  compound  transaction, 
the  failure  safe  rule  examines  the  transaction  ADS  segments  in  each  of  the  elementary  transactions.  If  an  ADS 
segment  is  not  interleaved  with  other  transaction  ADS  segments  in  the  same  elementary  transaction,  then  this 
ADS  segment  is  an  atomic  commit  segmenL  Interleaved  ADS  segments  in  an  elementary  transaction  must  be 
merged  to  form  a  single  atomic  commit  segmenL 


From  an  application  point  of  view,  there  are  two  important  points  one  should  be  aware  of.  FirsL  one 


505 


sh(HiiU  write  one’s  compound  transiiciion  carefully  to  avoid  the  interleaving  of  iransitciion  ADS  segments, 
whenever  this  is  possible.  Unneccssiirily  interleaving  the  steps  of  iraiiMciion  ADS  segments  in  an  elementary 
transiiciion  could  lead  to  a  serious  loss  of  system  concurrency.  Second,  one  must  reali/e  that  ilie  spirit  of  the 
failure  safe  rule  is  to  provide  "check-points",  so  that  a  transaction  can  resume  its  execution  after  being 
interrupted  by  failures.  As  a  matter  i)f  fact,  once  we  liave  committed  one  single  atomic  commit  segment  of  a 
transaction,  wc  can  only  abort  commit  segments  not  yet  committed.  We  cannot  tliis  transaction  as  a  whole. 
Once  wc  commit  a  commit  segment  we  arc  obliged  to  complete  tlic  execution  of  our  tran&iction.  For 
example,  in  the  compound  transaction  Gct-A-and-B  introduced  in  the  overview  of  this  thesis,  once  wc  have 
obtained  one  unit  of  a  resource  and  committed  the  operation,  wc  arc  obliged  cither  to  get  the  other  unit  or  to 
put  back  the  unit  which  wc  have  already  taken. 

Wc  give  a  few  informal  suggestions  regarding  the  implementation  of  the  failure  siifc  rule.  The  standard 
distributed  version  of  the  two  phase  commit  proux:oi^(licrnsiciii  8.^]  can  be  adopted  to  commit  those  atomic 
commit  segments,  'fhe  major  miKlification  is  that  wc  also  need  to  store  tlic  values  of  local  variables  of  a 
commit  segment  in  the  stable  storage.  Saving  these  values  in  the  stable  storage  is  important  for  resuming  the 
cxccuuon  of  a  transaction  that  has  been  intermpted  by  a  failure.  To  optimize  the  storage  utilization,  wc  can 
save  only  those  local  variables  that  arc  shared  by  different  commit  segments.  The  identification  of  those 
shared  variables  can  be  made  easy  if  one  is  willing  to  request  the  explicit  declaration  of  them  as  atomic 
variables  in  the  program  of  a  compound  transaction. 

Finally,  wc  want  to  comment  on  the  orphan  problem  of  a  nested  transaction,  because  tliis  is  a  frequently 
discussed  topic  [Gorce  83],  This  problem  occurs  when  a  higher  level  nested  transaction  aborts  but  the  lower 
level  ones  still  running.  For  example,  supposed  that  nested  transaction  T  is  running  at  the  home  node  and 
invokes  a  sub-transaction  at  a  remote  node.  If  transaction  T  is  aborted  by  a  failure  at  the  home  node  then  the 
sub-transaction  running  at  the  remote  node  becomes  an  orphan.  It  is  important  to  point  out  that  this  orphan 
problem  is  an  optimization  issue,  not  a  database  consistency  or  uansaciion  correctness  issue.  This  is  because  a 
sub-transaction  of  a  nested  transaction  cannot  commit  its  results  to  the  database.  When  an  orphan  sub¬ 
transaction  eventually  finishes  its  execution,  it  will  try  to  pass  its  lock  and  results  to  its  parent  If  its  parent 
cannot  be  contacted  by  the  database  system,  then  the  sub-transaction  will  be  aboned  by  the  recovery  manager 
of  the  database  system.  Since  an  orphan  is  destined  to  be  aborted  and  it  consumes  resources,  it  is  desirable  to 
track  down  orphans  and  abort  them  as  quickly  as  possible. 

^Softic  calls  It  three  phase  commit  protocol 


506 


Since  our  compound  transuciions  arc  partitioned  into  a  partially  ordered  set  of  commit  segments  by  our 
failure  safe  nilc  and  cadi  of  Uie  commit  segments  is  committed  in  an  order  iliat  is  consistent  witli  the  partial 
order  of  commit  segments,  we  do  not  have  die  orphan  problem  at  the  level  of  managing  commit  •segments  of  a 
compound  transaction.  However,  within  a  commit  segment  we  can  have  the  orphan  problem  because  a 
commit  segment  can  be  an  elementary  transaction  in  a  nested  form.  Hie  investigation  of  die  orphan  problem 
is  beyond  die  scope  of  this  thesis  and  we  suggest  die  use  of  algorithms  in  the  database  literature. 


507 


7.  An  Example  of  Application 

riic  two  iTwin  aspects  in  the  application  of  oiir  theory  arc  to  partition  tlic  database  into  consistency 
preserving  atomic  data  sets  and  to  write  and  schedule  compound  transactions.  In  this  chapter,  we  give  an 
example  to  illustrate  these  two  aspects.  -  In  Section  7.1,  we  give  a  brief  review  of  our  theory  from  an  appiica* 
tion  point  of  view.  In  Section  7.2.  we  address  the  partition  of  the  database  and  in  Section  7.3  we  discuss  the 
writing  of  compound  transiictions.  Readers  who  arc  interested  in  more  detailed  examples  in  the  context  of 
distributed  operating  systems  and  readers  who  arc  interested  in  the  integration  of  our  concurrency  control 
and  failure  recovery  mlcs  with  the  object  oriented  (abstract  data  type)  approjich.  arc  rccommandcd  to  read 
Tokuda's,  Clark’s  and  Locke’s  work  (Tokuda  85).  One  of  their  main  research  efforts  is  to  develop  a  new 
computational  model  culled  ArobjccLs  which  integrates  our  theory  with  distributed  abstract  data  types.  A 
brief  review  of  this  effort  is  in  Section  8.2,  future  work. 

7.1  A  Brief  Review  of  Our  Theory 

Conceptually,  while  atomic  data  sets  result  from  a  consistency  preserving  partition  of  the  database,  elemen¬ 
tary  transactions  result  from  a  correctness  prcscr\'ing  partition  of  a  transaction.  Given  a  database  and  its 
associated  consistency  constraints,  we  partition  the  database  into  consistency  preserving  atomic  data  sets.  As 
long  as  the  consistency  of  each  atomic  data  set  is  satisfied  individually,  the  consistency  of  the  database  is  also 
satisfied.  Given  a  transaction  and  its  associated  post-conditions,  we  partition  the  transaction  into  a  partially 
ordered  set  of  correctness  preserv  ing  elementary  transactions.  As  long  as  the  post-condition  of  each  elemen¬ 
tary  U'ansaction  is  individually  satisfied,  the  post-conditions  of  the  compound  transaction  are  also  satisfied 

Each  elementary  transaction  must  preserve  the  consistency  of  the  database  and  satisfy  its  own  post¬ 
condition  when  it  is  executed  serially  and  in  an  order  consistent  with  the  partial  ordering  of  elementary 
transactions  in  the  compound  transaction  Given  a  compound  transaction,  the  generalized  setwise  serializable 
scheduling  rule  partitions  the  steps  of  each  elementary  transaction  into  transaction  ADS  segments.  A  trans¬ 
action  ADS  segment  is  Just  all  the  steps  in  an  elementary  transaction  that  access  the  same  atomic  data  set  This 
means  that  as  long  as  we  execute  all  the  elementary  transactions  setwise  scrializably  and  in  a  order  consistent 
with  the  partial  orderings  of  elementary  transactions  defined  by  compound  transactions,  the  concurrency 
execution  of  compound  transactions  will  be  consistent  and  correct. 


f’o  cope  with  syslcin  fjiliircs.  ihc  liiilurc  siifc  rule  oiuinincs  the  tninsiicliun  ADS  segment  of  cacli  elemen¬ 
tary  transaction.  If  tJic  steps  of  a  iraasiiction  ADS  segment  are  not  interleaved  with  the  steps  of  other 
transaction  ADS  segmeiiLS.  tlien  tliis  segment  is  named  as  a  commit  segment.  If  the  stei)s  of  several  trans¬ 
action  AI^S  segments  arc  interleaved  with  each  other,  then  they  arc  merged  into  a  single  commit  segment, 
lliis  prixrcdurc  creates  a  partially  ordered  set  of  commit  segments.  As  long  as  ctich  compound  transaction 
commits  its  commit  segment  in  an  order  consistent  with  the  partial  ordering  of  its  commit  segments,  die 
concurrent  control  will  remain  consistent  and  correct  despite  clean  and  soft  system  failures. 

In  order  to  avoid  casctidcd  aborts  the  writclocks  acquired  by  a  commit  segment  cannot  be  rclca.scd  until  it 
has  successfully  committed,  llius.  for  all  practical  purpose  commit  segments  arc  the  atomic  units  for  both 
concurrency  control  and  failure  recovery  in  a  compound  transaction.  As  a  style  of  programming,  we  prefer  to 
make  elementary  transactions  correspond  to  commit  segments;  our  atomic  units  for  both  concurrency  control 
and  failure  recovery.  ITiat  is.  we  wc>uld  like  to  write  compound  transactions  in  such  a  way  tJiat  each  elemen¬ 
tary  transaction  contains  a  single  commit  segment  lliis  tends  to  simplify  the  implementation  of  die  concur¬ 
rency  control  and  failure  recovery  of  compound  transactions,  brcausc  we  can  use  the  scope  of  elementary 
transaction  as  delimiters  for  the  runtime  management  of  concurrency  control  and  failure  recovery. 

7.2  Partitioning  The  Database 

From  an  application  point  of  view,  to  partition  the  database  into  consistency  preserving  atomic  data  sets  is 
straightforward  if  the  consistency  constraints  are  given.  However,  the  specification  of  database  consistency 
constraints  needs  a  good  understanding  of  die  applications  in  question. 

In  order  to  maximize  the  benefit  of  parallelism  in  a  distributed  system,  it  is  often  useful  to  consider  the  use 
of  consistency  constraints  that  are  weaker  than  the  corresponding  ones  in  centralized  systems.  We  illustrate 
this  idea  by  a  simplified  case  of  managing  the  directories  of  a  file  system  as  follows.  We  consider  a  set  of 
shared  system  files  distributed  at  different  nodes.  There  is  a  local  directory  (LD)  at  each  node  indicating  the 
resident  files.  With  only  these  local  directories,  one  must  potentially  search  through  all  the  LD's  in  order  to 
locate  a  file,  and  this  would  be  very  inefficient.  To  increase  efficiency,  the  system  has  a  global  directory  (GD) 
at  some  node.  The  GD  indicates  which  LD  should  be  searched  for  each  of  the  shared  files.  The  GD  is 
replicated  for  reliability  and  performance.  When  one  needs  a  file,  the  local  operating  system  kernel  will  first 
search  through  its  LD.  and  then  it  will  search  a  nearby  GD,  if  the  file  is  not  in  its  LD.  Ihe  introduction  of 


509 


Corollar/  6.5:  ITic  failure  safe  rule  is  conbistem  and  correct. 

’Dicorcin  h.(y.  Ilie  failure  xifc  rule  ‘Jl>  is  the  optimal  failure  recovery  rule  within  ilic  set  of  all  die  modular 
and  safe  failure  recovery  rules  for  iransiiciion  systems  using  generalized  setwise  serializable  scheduling  rules. 

Pnxif:  To  prove  the  optimality  of  the  failure  safe  rule  "ifc.  we  need  to  show  tliat  the  commit  segments 
pnxluccd  by  Uh  for  any  transaction  is  the  most  refined  partition.  There  are  two  cases.  First,  a  transiiccion  ADS 
segment  of  transaction  1'.  could  be  taken  as  a  commit  segment  by  ‘31>.  Ily  Theorem  6.3.  tliis  commit  segment 

cannot  be  further  partitioned.  In  the  second  case,  transaction  ADS  segments  (Tj . in  an  elementary 

transaction  of 'T.  arc  taken  as  a  single  commit  segment  by  91),  because  ihc\  arc  interleaved.  Suppose  tliat  iliey 
arc  interleaved  and  taken  as  a  single  atomic  commit  segment  or.  Suppose  that  there  exists  a  modular  and  safe 

recovery  rule  which  divides  a  into  more  than  one  atomic  commit  segments.  <t^  j . ,  where  n  >  2.  Uy 

ITieorem  6.3.  each  of  these  commit  segments  must  contain  an  integer  number  of  iians;ittion  ADS  segments. 

Therefore,  a  . a  must  be  supersets  of  transaction  ADS  segments,  i.et  the  first  failure  txrcur  jtisi  after 

the  commit  of  and  let  there  be  no  more  failures.  Suppose  that  ilie  ordering  of  steps  of  T.  in  z(l)  is 
consistent  with  the  internal  ordering  of  steps  in  T..  I'his  contradicts  the  assumption  that  the  steps  of  die 
transaction  ADS  segments  in  a  arc  interleaved,  because  in  zd)  all  the  steps  of  ^  in  precede  all  the  steps 
in  <T  <j  in  z,.  It  follows  that  steps  of  T.  in  z(l)  violates  the  internal  ordering  of  steps  in  T.  and 
therefore  z(l)  is  not  generalized  setwise  serializable.  This  contradicts  the  assumption  that  %  is  safe.  'ITius. 
the  commit  segments  produced  by  91)  for  any  transaction  T.  cannot  be  refined  by  other  modular  and  safe 
recovery  rules.  □ 

6.3  Summary 

In  this  chapter,  we  have  shown  that  when  the  scheduling  rule  is  the  generalized  setwise  serializable  schedul¬ 
ing  rule,  the  optimal  modular  and  safe  recovery  rule  is  the  failure  safe  rule.  Given  a  compound  transaction, 
the  failure  safe  rule  examines  the  transaction  ADS  segments  in  each  of  the  elementary  transactions.  If  an  ADS 
segment  is  not  interleaved  with  other  transaction  ADS  segments  in  the  same  elementary  transaction,  then  this 
ADS  segment  is  an  atomic  commit  segment  Interleaved  ADS  segments  in  an  elementary  transaction  must  be 
merged  to  form  a  single  atomic  commit  segment 


From  an  application  point  of  view,  there  are  two  important  points  one  should  be  aware  of.  First  one 


511 


GD  but  do  not  need  to  use  concurrency  control  protocols  to  make  the  transfer  of  Uic  file  and  the  updating  of 
lire  I.[)  and  Cl)  appear  lo  be  an  instantaneous  event  with  respect  to  other  triinsaclions.  On  die  other  hand,  if 
we  want  to  enforce  the  ct)nsisccnt  conioaint  "each  of  die  file  entries  in  a  G!)  always  points  to  the  current 
location  of  the  file",  Uicn  a  transaction  must  use  concurrency  control  protocols  to  ensure  that  die  transfer  of 
the  file  and  the  updating  of  the  Cl)  and  the  I.l)  appears  to  be  an  instantaneous  event. 

From  an  application  point  of  view,  although  all  the  consistent  suites  arc  equivalent  in  the  sense  that  none  of 
them  will  cause  transactions  to  execute  incorrectly,  certain  consistent  states  arc  more  favorable  than  others. 
Performance  enhancement  schemes  arc  designed  to  increase  the  probability  of  the  database  staying  in  the 
favorable  states.  For  example,  suppose  that  we  replace  the  strong  consistency  constraint  "each  of  the  file 
entries  in  a  GD  always  points  to  the  current  location  of  the  file"  with  die  weak  one  "each  of  the  file  entries  of 
a  GD  can  point  to  either  the  current  location  or  a  historical  location  of  the  file”.  In  this  ease,  die  states  in 
which  file  entries  of  a  GD  point  to  die  old  locations  arc  consistent  but  unfavorable  states.  1  he  performance 
enhancement  scheme  "update  die  GD  whenever  the  location  of  a  file  changes”  will  only  improve  the  chance 
that  transactions  will  see  the  consistent  and  favorable  state  "each  of  die  file  entries  in  a  GD  points  to  the 
current  location  of  die  file."  However,  this  is  not  a  guarantee,  (t  is  possible  that  whcii  transaction  transfers 
file  F  from  node  1  to  node  3  and  updates  the  LD  at  node  1.  another  transaction  T,  is  reading  the  GD  at  node 
2.  which  indicates  that  file  F  is  in  node  I.  When  transaction  is  updating  the  GD  at  node  2,  searches  the 
LD  at  node  1  to  locate  file  F  and  only  finds  diat  file  F  is  no  longer  in  node  1. 

Generally  speaking,  weaker  consisfency  constraints  permit  a  higher  degree  of  concurrency.  However,  once 
the  consistency  constraints  are  weakened,  the  complexity  of  transactions  will  be  increased  for  two  reasons. 
First,  the  process  of  weakening  the  consistency  constraints  enlarges  the  number  of  system  states  that  are 
considered  to  be  consistent.  For  example,  if  the  set  of  consistency  constraints  regarding  the  GD  and  PGD  is 
relaxed,  a  transaction  must  be  written  to  function  correctly  in  the  case  that  the  GD  or  PGD  will  only  give  a 
valid  LD  location  but  not  necessarily  the  LD  location  where  the  file  actually  resides.  That  is,  a  transaction 
must  be  able  to  abort  when  the  file  cannot  be  found  or  traced.  Transactions  must  have  the  ability  to  deal  with 
all  the  possible  system  states  that  are  consistent.  Second,  strong  consistency  constraints  generally  ensure  that 
the  system  will  stay  in  a  small  set  of  favorable  sutes.  although  enforcement  could  be  too  expensive.  When  the 
set  of  permissible  states  is  enlarged,  the  post-conditions  of  transactions  must  be  redesigned  to  better  keep  the 
system  in  favorable  slates.  This  also  increases  die  complexity  of  transactions.  Tlic  evaluation  of  the  trade-offs 
between  system  concurrency  and  transaction  complexity  is  an  exciting  new  research  area.  However,  the 


512 

analysis  of  the  performance  trade-off  is  outside  the  the  scope  of  this  tliesis.  As  a  nile  of  thumb,  however,  it  is 
usually  worthwhile  to  have  weaker  consistency  constraints  if  the  resulting  increase  of  complexity  in  trans¬ 
actions  is  mapagcablc,  such  as  it  is  in  the  example. 

7.3  Writing  Compound  Transactions 

From  a  programmer’s  point  of  view,  under  serializability  theory  the  entire  set  of  steps  of  a  transaction  is  a 
single  unit  of  atomic  action,  that  is.  it  is  a  single  commit  segment.  Under  our  theory,  a  compound  transaction 
is  a  partially  ordered  set  of  atomic  actions  in  the  form  of  atomic  commit  segments.  When  an  atomic  action  is 
committed,  it  leaves  the  database  consistent.  Therefore,  the  results  of  its  computation  can  be  seen  by  other 
transactions  and  the  locks  held  by  it  can  be  released. 

In  order  to  improve  system  concurrency,  we  should  try  to  have  many  small  atomic  actions  rather  than  a  few 
big  ones  in  the  writing  of  a  compound  transaction.  Since  the  consistency  of  each  atomic  data  set  can  be 
preserved  independently  of  others,  one  way  to  have  small  atomic  actions  is  to  try  grouping  the  operations  on 
each  of  the  atomic  data  sets  into  elementary  transactions.  There  arc  three  guiding  principles  in  doing  so. 

1.  Each  of  these  elementary  transactions  must  preserve  the  consistency  of  the  accessed  atomic  data 
sets  so  that  other  transactions  will  not  see  inconsistent  states. 

2.  The  compound  transaction  must  have  a  method  to  handle  the  consequence  of  releasing  the  locks 
by  its  elementary  transactions.  This  is  because  when  the  locks  on  an  atomic  data  set  are  released, 
the  state  of  this  atomic  data  set  is  likely  to  be  changed  by  other  transactions. 

3.  Committing  an  elementary  transaction  is  as  serious  as  committing  an  entire  transaction  under  a 
serializable  schedule.  We  cannot  abort  a  committed  action:  an  entire  transaction  or  an  elemen¬ 
tary  transaction.  Once  we  have  committed  an  elementary  transaction  of  a  compound  transaction, 
we  are  obliged  to  complete  the  execution  of  the  compound  transaction.  For  example,  in  the 
compound  transaction  Get-A-and-B  discussed  in  the  overview,  if  we  have  obtained  one  of  the 
resources  and  committed  the  elementary  transaction,  then  we  are  obliged  to  complete  the  execu¬ 
tion  of  Get-A-and-B  by  either  putting  back  the  obtained  resource  or  getting  another  one. 

We  now  illustrate  the  construction  of  a  compound  transaction  by  a  file  transfer  transaction  example. 

Suppose  that  in  the  directory  example  presented  in  the  previous  section  we  have  two  global  directories: 

GDI  and  GD2.  a  set  of  local  directories:  LDl . LDn  and  a  set  of  partial  global  directories:  PGDl . 

PGDn.  According  to  the  discussion  in  the  previous  section,  each  of  them  is  modelled  as  an  atomic  set  In  this 


513 


example,  we  assume  that  each  1*01)  and  each  GO  is  a  list  of  <filc  name,  name  of  resident  nodc>  pairs. 
However,  each  LI)  directory  has  a  tree  structure.  For  simplicity,  we  assume  that  each  tree  has  only  two  levels: 
the  root  and  the  leaves.  ITie  root  is  a  list  of  <filc  name,  file  descriptor  pointer)  pairs.  Kach  of  the  leaves 
represents  a  file  descriptor  which  is  a  record  of  various  attributes  such  as  logical  name,  system  name,  the 
storage  area,  the  capability  attribute  and  the  file  state  attribute,  'fhe  capability  attribute  describes  the  current 
state  of  user  capabilities  such  as.  it  is  a  read  only  file  for  all  users  except  for  the  one  who  holds  a  special  key. 
'Ilte  file  state  attribute  desciibes  who  holds  a  readlock  or  wriiclock  on  the  file.  For  simplicity,  we  also  assume 
that  all  the  system  files  have  unique  logical  names  given  by  some  system  name  server.  Figure  7-2  shows  the 
tree  stnicturc  of  a  local  directory. 


Root  of  LD 


File  Pil,  File 

Descriptor  1  Descriptor  2  Descriptor  3 


Figure  7-2:  A  Local  Directory 


The  tree  structure  of  a  LD  directory  suggests  using  the  tree  lock  protocol  (Silbcrschatz  80].  To  enforce 


514 


^ct.wl^c  scruli/jhiiitv  in  die  c.xcciitk-n  of'clcmciu.iry  trjns.icU()ns.  we  must  observe  the  following  protocol  on 
Ouch  of  the  tree  viruciurcd  atomic  data  sets; 

•  except  for  the  first  data  object  locked,  no  data  object  in  a  tree  can  be  locked  unless  a  lock  is 
currently  held  on  its  parent. 

•  no  data  object  is  ever  locked  twice,  that  is.  an  data  object  cannot  be  tmlocked  and  Uicn  locked 
again  later  by  the  same  elementary  transaction. 

I'o  avoid  cascaded  aborts,  we  have  the  additional  requirement  that  each  clemcnuiry  transaction  keeps  all  the 
wntelocks  until  it  has  committed.  For  unstructured  atomic  data  sets  we  will  use  the  setwise  two  phase  lock 
proUK-oI.  This  proUKol  requires  an  elementary  transaction  not  to  release  any  lock  until  it  has  acquired  all  the 
locks  in  an  atomic  data  set.  To  avoid  cascaded  aborts,  we  will  keep  all  the  writclocks  acquired  in  an  atomic 
commit  segment  until  the  atomic  commit  segment  has  successfully  committed.  Note  that  an  atomic  commit 
segment  can  include  several  interleaved  transaction  ADS  segments.  This  means  th.it  as  long  as  readlocks  are 
concerned,  we  can  release  them  by  following  die  setwise  two  phase  lock  protocol.  But  the  release  of  writclocks 
will  be  further  delayed  until  the  atomic  commit  segment  has  successfully  committed. 

A  file  transfer  tr.insaction  will  be  given  the  file  name,  the  capability  key.  die  names  of  die  source  node  and 
die  destination  node.  If  dds  transaction  cannot  find  the  file  in  the  source  noue.  it  will  report  to  the  user  that 
his  file  IS  not  in  the  source  node  and  quit.  In  addition,  if  the  transaction  finds  out  diat  the  user's  capability  key 
does  not  allow  him  to  move  diis  particular  file,  he  will  also  be  infonned  accordingly.  Otherwise,  it  will 
transfer  the  file  from  source  node  to  the  destination  node.  Jeave  a  forwarding  address  at  the  PCD  of  the 
source  node  and  update  the  source  LD.  destination  I  D.  GDI  and  GD2. 

Before  designing  the  compound  transaction,  we  would  like  first  to  give  a  more  detailed  description  of  this 
transaction.  The  first  atomic  data  set  accessed  by  this  compound  transaction  is  the  source  I.D.  We  begin  with 
an  elementary  transaction  Check-Out  which  first  readlocks  the  root  of  source  LD  and  checks  if  the  file  is  in 
this  node.  If  it  cannot  find  the  file  name,  it  will  repon  to  the  user  that  his  file  is  not  in  the  source  node  and 
then  quit.  If  it  finds  the  file,  then  it  writclocks  the  desenptor  of  this  file  and  checks  if  the  user  has  the 
capability  to  move  this  file.  If  the  user  does  not  have  the  required  capability,  he  will  be  informed  accordingly 
and  our  compound  transaction  stops  here.  If  the  user  has  the  capability  to  do  so.  then  Check-Out  first  copies 
the  file  descriptor  to  a  pnvate  area  in  the  stable  storage  and  then  updates  the  capability  attribute  of  this  file 
descriptor  to  disable  the  capabilities  of  write,  delete  and  transfer  for  all  otlicr  transactions.  However,  users 


515 


who  have  the  capability  to  read  tliis  file  can  still  read  it.  After  we  have  copied  tlic  file  from  the  source  node  to 
the  destination  node  and  updated  die  source  PCD.  GDI  and  GD2.  we  will  wail  for  those  read  iransiictions 
finishing  Uieir  reading  and  tlicn  delete  the  source  file  and  update  ilie  strurce  l.D.  'Hiis  arrangement  will 
provide  an  un-interrupted  service  to  users  of  the  system  files. 

Having  committed  elementary  transaction  Check-Out.  we  suiri  an  elementary  transaction  Copy-Filc-to- 
Ituffcr,  which  copies  both  the  saved  original  file  descriptor  and  the  file  into  a  bulTcr  in  the  stnircc  mxlc.^ 
Having  copied  the  file  into  the  source  buffer,  we  start  another  elementary  transaction  Send,  which  sends  the 
content  of  the  source  bufTcr  to  the  buffer  at  the  destination  node.  After  sending  the  file  and  its  descriptor  to  a 
buffer  in  the  destination  node,  we  start  a  new  elementary  transttetion  Copy-Kilc-from-lUiffcr  which  writekxtks 
the  root  of  the  destination  l.D,  copies  the  file  from  the  buffer  to  the  disk,  updates  tlic  list  of  <filc  name,  file 
descriptor  pointcr>  pairs  at  the  root  and  creates  the  corresponding  file  descriptor  at  the  destination  node  for 
die  transferred  file.  After  the  file  is  copied  over,  we  start  three  elementary  transactions  in  parallel.  One  is  to 
leave  a  forwarding  address  in  the  IXjD  of  the  source  node,  one  is  to  update  GDI  and  tlic  last  one  is  to  update 
GD2.  After  the  source  PGD.  GDI  and  GD2  arc  updated,  we  arc  now  ready  to  delete  the  file  at  the  source 
node.  We  Stan  a  new  elementary  transaction  Dcicic  to  delete  the  file,  file  descriptor  and  the  corresponding 
<rilc  name,  file  descriptor  pointer)  at  the  root  of  the  source  I-D. 

In  the  following,  we  first  outline  the  structure  of  compound  transaction  Transfer.  Next,  we  comment  on 
the  implementation  of  recovery  management.  Finally,  we  give  the  pseudo-code  of  compound  uansaction 
Transfer. 


CompoundTransaction  Transfer 
AtomicVariable  continue  :  Boolean; 

SourceBuffer,  DestinationBuJJer:  array[  1 . .  s  i  ze  ]  of  char ; 

BcginScrial 

ElcmentaryTransaction  Check-Out; 

{This  transaction  sets  the  variable  continue  to 
either  true  or  false  according  to  user’s  capability.} 


4 


For  simplicity,  we  assume  that  we  nave  large  buirers  each  of  which  can  holQ  n  entire  liic. 


516 


if  not  coniinue  tlicn  jjolo  I  asi-Sfep 


KIcmcntary Transaction  Copy-File-to-Buffer; 

KIcmcntaryTransaction  Send; 

KIcmcntary  Transaction  Copy-Fi  1  e-f  rom-Buf  f  er ; 


Bcgint^arallcl 

Elcmcntary'Transaction  Update-PGD; 

KIcmcntaryTransaction  Update-GDl ; 

KIcmcntaryTransaction  Update-GD2 ; 

F.ndl’arallcl ; 

KIcmcntaryTransaction  Delete; 

Last-Step:  Commit  and  rcJciisc  all  the  locks; 
KndScrial.  {end  of  compound  transaction  Transfer} 


Having  outlined  the  structure  of  compound  transaction  Transfer,  we  now  comment  on  the  recovery 
management  of  this  transaction. 


CompoundTransaction  Transfer 
.VtoinicVariabIc  coniinue  :  Boolean; 

Source Bujfer,  DeslinationBuJfer:  array [  1 . .  s  i  ze  ]  of  char ; 

BcginScrial 

ElementaryTransaction  Check-Out 

{Checks  the  user  capability.  If  o.k.  then  updates  the  capability 
attribute  of  the  file  descriptor  and  sets  coniinue  to  true.} 

{When  Check-Out  commits,  the  value  of  continue  is  saved  at  disk. 

In  addition,  the  recovery  manager  check-points  that  the  first 
commit  segment  of  Transfer  has  been  committed.  If  a  failure  occurs 
after  the  commit  of  Check-Out  and  before  the  second  commit,  the 
execution  of  Transfer  will  be  re-started  at  the  first  step  after 
Check-Out  and  the  values  of  atomic  variables  are  restored  from  disk. 
We  must  emphasize  that  the  commit  of  the  results  of  an  elementary 
transaction,  the  save  of  the  values  of  atomic  variables  and  the 
check-point  of  the  execution  of  the  compound  transaction  must 
be  done  atomically:  either  all  three  is  done  or  none  is  done.} 


517 


If  not  cnntimc  then  goto  iMS/Slep; 

KIciiicniary  Transact  ion  Copy-File-to-Buffer 

{Copies  the  file  and  its  descriptor  to  the  source  node  buffer.} 
{Commit  point  and  execution  check-point.} 

KIcMiicnlary  Trunsaction  Send' 

{Sends  the  content  of  the  source  buffer  to  the  destination  buffer.} 
{Commit  point  and  execution  check-point.} 

Klvnicntary Transaction  Copy-F1  le-f rom-Boffer 

{Copies  the  file  from  destination  buffer  to  disk  and  updates  the  LO.} 
{Commit  point  and  execution  check-point.} 

Begin  Parallel 

Klvnicntary Transaction  Update- PGD ; 

{commit  point  and  execution  check-point.} 

Klcmcntar/Transaction  Update-GOl ; 

{commit  point  and  execution  check-point.} 

Klemcntary  Transaction  Update-G02 ; 

{commit  point  and  execution  check-point.} 

F.ndParallcI ; 

{These  three  elementary  transactions  will  be  executed,  committed  and 
check-pointed  in  parallel.  For  example,  if  only  one  is  successfully 
committed  and  check-pointed  but  the  other  two  are  aborted  by  a 
failure,  then  the  aborted  two  will  be  restarted  in  parallel.} 

ElcmcntaryTransaction  Delete; 

{Deletes  the  file  at  source  node  and  updates  LD.} 

{Commit  point  and  execution  check-point} 

Last-Step:  Commit  and  release  all  the  locks; 

EndSerial. {end  of  compound  transaction  Transfer} 


Having  discussed  the  structure  of  compound  transaction  Transfer,  we  now  list  the  pseudo-code  of  this 


transaction. 


518 


C'ompuuiui  rrunsai'tioii  Tr ansf er ( f  i  1  e-name .  capab i  1  i  ty-key  , 

source-node,  destination-node); 

At'omicVuriuhIc  coniinuc  :  Boolean; 

Source  Buffer.  DesiinalionBuffer:  array[l.  .size]  ofcliur; 

Ik'gitiScrial 

KlcnicMitary  l’ransaction  Check-Out(f  ile-name.  capabil  ity-key) ; 

BeginSerial 

Keadlock  the  root  of  source  LO; 

If  file  is  not  here 
then  BeginSerial 

Report  "file  is  not  in  source  node"; 
couiinue  ;  =  false ; 

KndSerial 

else  continue  :  =  true ; 


if  continue 
then  BeginSerial 

NVritelock  the  file  descriptor; 

Check  if  the  user  has  the  capability  to  move  the  file; 
if  he  has  the  capability 
then  BeginSerial 

Copy  the  descriptor  to  a  private  area  in  disk; 

Update  the  capability  attribute  to  disable  the  delete, 
write  and  transfer  capabilities  of  other  transactions. 
KndScriai 
else  BeginSerial 

continue  :  *  false ; 

Report  to  user  that  he  cannot  move  the  file; 

EndSerial ; 

F.ndScrial ; 

Commit  and  release  all  the  locks; 

{The  atomic  variable  continue  is  saved  at  stable 
storage  as  part  of  the  commit  operation.} 

EndScrial;  {end  of  Check-Out} 


If  not  continue  then  goto  Last-Step; 


519 


f.lcincniary'lratisaction  Copy-Fil e-to-Buf fer( f  i le-name  ,  buffer- name) ; 
lk{>in.Scrial 

Kcadlock  the  root  of  source  LD; 

Kcadlock  the  file  descriptor; 

Copy  the  file  descriptor 

saved  in  a  private  area  of  the  stable  storage  into  the  buffer; 
Kcadlock  the  file; 

Copy  the  file  into  the  buffer; 

Commit  and  release  all  the  locks; 

KndScrial; {end  of  Copy-Fil e-to-Buffer} 

Elementary  Transaction  Send(  source-buffer-name , 

destination-buffer-name) ; 

BesinScrial ; 

Copy  the  content  of  the  source  buffer  to  the  destination  buffer; 
Commit ; 

KndScrial : {end  of  Send} 

{Buffers  are  private  storage  area,  no  lock  is  needed.} 

ElcmentaryTransaction  Copy-Fi le-from-Buffer( file-name , 

desti nation -buffer- name) ; 

Hcgin-Serial 

Writclock  the  root  of  destination  LD; 

Copy  the  file  from  the  buffer  to  disk; 

Create  a  new  file  descriptor; 

Copy  all  the  attributes  from  the  original  one 
except  storage  area  and  file  state  attributes; 

In  the  storage  area  attribute,  record  the  new  storage  area; 

In  the  file  state  attribute. 

record  that  currently  there  is  no  lock  on  the  file; 

Insert  <file  name,  pointer  to  file  descriptor>  pair  to  the  root; 
Commit  and  release  all  the  locks; 

EndSerial; {end  of  Copy-Fi le-from-Buffer} 


BeginParallcl 

ElcmentaryTransaction  Update-PGD(fi1e-name,  source-node); 
BeginSeriai 

Writclock  the  source  PGO; 

Leave  the  new  address  of  the  file  in  PGD; 

Commit  and  release  all  the  locks; 

EndSerial; {end  of  Update- PGO} 


520 


KIviiMinlury I'rarsuction  Update-GD(file-nanie,  GDI); 
HvKinScriul 

Wrilclock  601; 

Update  the  transferred  file's  address; 
Commit  and  rcicxsc  all  the  locks; 
tndScrial;{end  of  Update-GO} 


KlcmvntaryTrunsaction  Update-60(f ile-name ,  GD2); 

RcginScrial 

Writclock  G02  file  name  entry; 

Update  the  transferred  file's  address; 

Commit  ano  release  all  the  locks; 

KndSerial; {end  of  Update-GO} 

FodParallcI ; 

FlemcntaryTrjiisaction  Delete( file-name .  source-node); 

BeginScrial 

Writclock  the  root  of  source  LD; 

Writclock  the  file  descriptor; 

Writclock  the  file; 

Delete  the  file; 

Delete  the  file  descriptor; 

Delete  the  <name,  file  descriptor  pointer>  pair  at  the  root; 
Commit  and  release  all  the  locks; 

F.ndSerial; {end  o^  Delete} 

Lasi-Siep:  Commit  and  release  all  the  locks; 

{In  this  compound  transaction,  this  step  is  only  a  formalism  in 
the  syntax.  All  the  locks  have  been  released  by  the  elementary 
transactions . } 

EndSerial.{end  of  Transfer} 


Since  we  divide  compound  transaction  Transfer  into  many  elementary  transactions  each  of  which  commits 
and  releases  its  acquired  locks,  we  obtain  a  very  good  degree  of  concurrency.  We  want,  however,  to  make  two 
comments  on  the  first  elementary  transaction  Check-Out  First,  the  manipulation  of  the  capability  attribute 
by  Check-Out  is  equivalent  to  having  atomic  readlocks  at  the  file  and  at  the  file  descriptor.  An  atomic 
readlock  is  a  readlock  recorded  in  the  stable  storage  and  therefore  will  not  be  affected  by  failures.  Instead  of 
manipulating  the  capability  attribute,  one  can  use  atomic  readlocks  if  one's  system  supports  them.  Although 
syntactically,  each  of  our  elementary  transactions,  including  Check-Out,  commits  and  releases  all  its  acquired 
locks,  Check-Out,  in  effect,  passes  the  "atomic  readlocks"  to  compound  transaction  Transfer  rather  than 
releasing  all  the  locks  after  it  has  committed. 


We  have  mentioned  that  when  an  elementary  transaction  commits,  it  could  release  all  the  locks  without 


521 


adversely  affeciirrg  other  transactions  because  it  leaves  the  daUibasc  in  a  consistent  state.  However,  releasing 
all  the  lucks  will  allow  other  transactions  to  modify  the  atomic  data  sets  in  an  unrestricted  way  such  as 
deleting  the  file,  and  our  compound  transaction  must  devise  a  way  to  cope  with  it.  Depending  upon  the 
application  in  question,  sometimes  it  is  easy  to  devise  such  a  way  and  sometimes  it  is  noL  In  this  particular 
instance,  releasing  all  the  locks  provides  a  high  degree  of  concurrency  but  also  makes  tlie  compound  trans¬ 
action  Transfer  very  complex. 

The  easiest  way  to  reduce  the  complexity  of  compound  transaction  Transfer  is  to  extend  the  scope  of 
Check-Out.  That  is,  we  make  the  first  elementary  transaction  to  be  Check-Out-and- Transfer.  Since  we  now 
hold  the  writelocks  on  both  the  source  LD  and  destination  LD  within  a  single  elementary  transaction  .  the 
problem  mentioned  above  cannot  happen.  Although  this  solution  results  in  a  simple  compound  transaction, 
the  concurrency  is  low,  because  the  transaction  holds  writelocks  on  both  source  and  destination  I.D’s  while 
transferring  a  file  across  a  network. 

A  reasonable  trade-off  between  the  system  concurrency  and  transaction  complexity  is  to  let  Check-Out 
modify  the  capability  attribute  or  use  the  atomic  rcadlocks  as  described-  before.  Generally  speaking,  having 
committed  an  elementary  transaction  we  have  two  options.  One  is  to  release  all  the  locks  and  deal  with  the 
consequence  later.  Another  is  to  use  atomic  readlocks  so  that  the  resulting  computation  is  visible  to  but  not 
changeable  by  other  transactions.  This  is  a  compromise  between  the  approaches  of  "ending  the  elementary 
transaction  and  releasing  all  the  locks"  and  of  "extending  the  scope  of  the  elementary  transaction  and  holding 
on  all  the  locks".  In  this  instance,  we  Consider  this  compromise  approach  is  a  good  choice  for  our  problem  at 
hand.  Recall  that  in  the  previous  section,  we  have  said  that  the  trade-offs  between  system  concurrency  and 
transaction  complexity  is  an  important  consideration  in  the  design  of  atomic  data  sets.  We  now  see  that  it  is 
also  an  important  consideration  in  the  design  of  a  compound  transaction. 


522 


8.  Conclusion 

8.1  Summary 

Wc  have  developed  a  formal  theory  of  modular  scheduling  and  recovery  rules,  'lliis  theory  coherently 
addresses  the  notions  of  consistency,  correctness,  modularity  and  optimality  in  concurrency  control  and 
taiiLi.''c  recovery.  I'arthcrmor^,  this  theory  provides  us  with  provabiy  consistent,  correct  and  optimal  modular 
concurrency  control  and  failure  recovery  rules. 

Tltc  four  main  results  in  dtis  theory  are  a  formal  model  of  modular  scheduling  rules,  die  setwise  scriali/abie 
scheduling  rule,  the  compound  transactions  and  their  associated  generalized  setwise  serializable  scheduling 
ailcs,  and  the  failure  safe  rule,  'fhe  formal  model  of  modular  scheduling  rules  permits  us  to  have  a  unified 
treatment  of  different  modular  scheduling  methods.  The  setwise  serializable  scheduling  rule  is  the  optimal 
application  independent  scheduling  rule.  The  generalized  setwise  serializable  scheduling  rules  form  a  com¬ 
plete  class  witliin  the  set  of  all  die  modular  scheduling  rules.  This  means  that  for  any  given  modular  schedul¬ 
ing  rule,  there  always  exists  a  generalized  setwise  serializable  scheduling  rule  iliat  provides  at  least  as  much 
concurrency.  The  failure  safe  rule  is  the  optimal  failure  recovery  rule  fur  any  given  set  of  compound 
transactions  scheduled  b/  the  generalized  setwise  serializable  scheduling  rules.  Finally,  our  theory  includes 
nested  transactions,  scrializability  theory  and  failure  atomicity  as  special  eases. 

8.2  Future  Work 

Our  work  on  modular  concurrency  control  and  failure  recovery  rules  only  addresses  one  of  the  many 
inter-related  topics  in  the  study  of  transaction  facilities  of  a  distributed  computer  system.  There  are  many 
related  topics  that  need  to  be  investigated.  For  example,  the  performance  evaluation  of  the  trade-offs 
between  system  concurrency  and  transaction  complexity  mentioned  in  Chapter  7  is  a  new  area  of  research. 
Wc  believe  that  the  proper  specification  of  consistency  constraints  of  a  distributed  computer  system  is  of 
fundamental  importance.  When  the  performance  of  a  distributed  computer  system  becomes  a  problem, 
many  practioners  will  use  some  form  of  ad  hoc  "weak"  consistency  constraints  to  allow  for  a  higher  degree  of 
system  concurrency.  It  would  be  helpful  to  have  a  formal  theory  to  guide  the  trade-offs  between  system 
concurrency  and  transaction  complexity. 


Another  interesting  area  of  research  is  the  concept  of  co-operaiing  iransactions  [Shi  83).  Co-operating 


523 


transactions  arc  a  model  of  a  team  of  cu-operating  U'ansaclions  which  co-ordinate  their  actions  without  a 
common  top  ievei  transaction.  As  an  analogy,  co-operating  transactions  arc  like  co-routines  rather  tJian 
sub-routines  of  a  higher  level  program.  Currently,  concurrency  control  tJieorics  do  not  allow  transactions  to 
exchange  mcs.sagcs  and  co-operate.  From  a  computational  point  of  view,  these  transactions  arc  merely 
priKCSscs  with  embedded  concurrency  control  and  failure  recovery  mechanisms.  Co-operating 
(communicating)  p'-'v-nescs  play  an  important  role  in  modern  operating  systems,  but  unfortunately  we  do  not 
have  a  theory  of  co-operating  transactions.  In  a  previous  paper  [Sha  83).  we  reported  some  experiments  with 
the  idea  of  co-operating  transactions:  a  set  of  transactions  with  a  set  of  consistency  constraints  defined  upon 
their  state  variables,  fhesc  consistency  constraints  represent  the  rule  of  co-operation.  However,  the  for¬ 
malization  of  these  ideas  still  has  a  long  way  to  go. 

The  direct  extentions  of  our  work  arc  those  tliat  arc  directly  related  to  the  development  and  implemen¬ 
tation  of  transaction  facilities.  To  implement  a  transaction  facility  tJiat  supports  the  application  of  our  tlicory. 
we  need  to  develop  programming  l.inguagcs  and  dieir  compiicr/intcrprctcr  that  support  the  writing  of  com¬ 
pound  transactions,  a  distributed  database  system  or  a  distributed  operating  system  that  provides  runtime 
support  for  die  implementation  of  atomic  data  sets  and  the  execution  of  compound  transactions.  To  use  our 
theory  for  the  construction  of  an  operating  system  rather  tlian  database  applications,  we  need  to  integrate  our 
concurrency  control  and  failure  recovery  rules  with  the  software  engineering  approach:  abstract  data  types. 

Although  tJiis  future  work  is  beyond  the  scope  of  this  thesis,  we  are  glad  to  report  that  progress  has  already 
been  made.  Currently,  in  the  Computer  Science  Department  of  Carnegie-Mellon  University,  Tokuda,  Clark 
and  Locke  arc  developing  a  decentralized  operating  system  ArchOS  [Tokuda  85],  which  is  the  operating 
system  of  the  Archons  decentralized  computer  system  project  [Jensen  83.  Jensen  84],  An  important  feature  of 
ArchOS  is  that  it  supports  transactions  at  the  kernel  level.  In  fact,  all  the  ArchOS  system  primitives  employ 
transactions.  Their  research  effort  in  the  area  of  transaction  facility  includes, 

•  The  development  of  a  new  model  of  computation  called  Arobject.  This  model  integrates  our 
theory  with  the  distributed  abstract  data  type  approach. 

•  The  formulation  of  an  alternative  syntax  of  compound  transactions,  which  is  more  suitable  for  the 
implementation  of  operating  system  objects. 

•  The  development  of  various  mechanisms  that  carry  out  our  concurrency  control  and  failure 
recovery  ailes. 


524 


•  ITic  design  and  impicmcntaiion  of  a  prototype  system. 

I  heir  computation  model  integrates  our  theory  with  distributed  abstract  data  type  approach.  In  this  diesis, 
a  database  is  modelled  as  a  set  of  shared  data  objects,  which  is  not  constructed  as  layered  and  encapsulated 
objects  in  the  form  of  abstract  dati  types.  To  apply  our  theory  to  the  construction  of  operating  system  objects, 
we  need  to  address  the  problem  of  integrating  existing  lower  level  operating  system  objects  into  a  higher  level 
one.  Although  currently  Arobjcct  is  still  in  a  developmental  stage,  we  would  like  to  describe  some  of  the  basic 
ideas  for  solving  the  abstraction  problem.  These  ideas  arc  important  for  the  application  of  our  theory  in  the 
context  of  operating  systems. 

In  essence,  the  idea  is  to  divide  an  Arobjcct  into  well  defined  levels  (layers)  of  abstractions  and  dicn  apply 
our  theory  to  each  of  the  levels  for  concurrency  control  and  failure  recovery  managcmcnL  An  Arobjcct  can 
consist  of  distributed  data  objects  and  their  associated  operations  with  many  levels  of  abstraction.  At  level 
zero  of  abstraction,  the  state  space  of  an  Arobjcct  is  the  Cartesian  products  of  the  domains  of  daui  objects  at 
level  zero.  An  Arobject  designer  chooses  a  set  of  consistency  constraints  that  defines  a  subset  of  the  state  space 
called  the  consistent  states  at  level  zero.  A  set  of  consistent  and  correct  transactions  is  then  written  to 
represent  the  operations  of  at  level  zero.  Our  theory  is  then  used  for  the  concurrency  control  and  failure 
recovery  of  this  set  of  transactions. 

Having  developed  the  level  zero  Arobjcct.  the  set  of  consistent  states  at  level  zero  is  then  partitioned  into 
equivalent  classes  each  of  which  corresponds  to  an  abstract  state  visible  at  level  one.  Viewed  at  level  one. 
each  of  the  transactions  that  can  be  called  at  level  one  represents  a  non-divisible  primitive  step  (operation)  at 
level  one,  which  transforms  a  level  one  state  to  another  level  one  state.  At  level  one,  an  Arobject  designer  can 
define  new  consistency  constraints  upon  the  abstract  state  space  at  level  one  and  write  transactions  composed 
by  the  primitive  operations  provided  by  level  zero.  A  level  one  designer  needs  not  worry  about  how  to 
maintain  the  consistency  constraints  at  level  zero.  These  constraints  arc  automatically  maintained  by  the 
primitive  operations  at  level  one.  Our  theory  can  then  be  applied  again  at  level  one  for  the  purpose  of 
concurrency  control  and  failure  recovery  of  transactions  written  at  level  one.  This  exercise  can  be  repeated 
again  and  again  to  produce  higher  levels  of  abstraction. 

As  an  example  of  this  approach,  suppose  that  Disk-l  is  an  Arobjcct  at  level  zero.  At  this  level,  Disk-1 
controls  the  disk  driver,  formats  the  disk  and  keeps  track  of  the  physical  addresses  of  each  block  and  and  its 


525 


state:  empty  or  allocated  to  user  i.  Supposed  tliat  there  is  a  toul  of  100.000  blocks  of  storage  at  disk  one.  An 
abstract  state  provided  by  level  zero  to  level  one  is  in  the  form  of  <number-of-cmpty-blocks.  number-of- 
blocks-used-by-user-i.  0  <  i  <  maximal-numbcr-of-users>  with  associated  consistency  constraint  "the  sum  of 
empty  blocks  and  allocated  blocks  equals  to  100,000".  A  level  one  state  describes  the  number  of  empty  blocks 
and  how  many  blocks  used  by  each  of  the  users,  but  ignores  the  detailed  information  such  as  the  physical 
locations  of  user  fs  blocks.  'Fhe  primitive  operations  provided  by  level  zero  to  level  one  arc  transactions 
Allocate- Blocks  and  Deallocate- Blocks.  Note  that  the  consistency  constraints  at  level  zero  such  as  those 
associated  with  storage  locations  (map)  and  formating  are  not  visible  at  level  one  but  will  be  enforced  by 
transactions  Allocate-Blocks  and  Dcallocatc-Blocks.  As  long  as  a  level  one  designer  is  concerned,  these  two 
transactions  play  the  role  of  primitive  atomic  steps  at  his  level.  Eiach  of  this  two  steps  maps  a  level  one  state  to 
another  level  one  state.  A  level  one  state  describes  the  numbers  of  empty  blocks  and  the  numbers  of  blocks 
allocated  to  each  of  the  users.  The  fact  that  these  two  steps  are  actually  complex  transactions  which  controls 
the  disk  driver  and  maintains  many  level  zero  consistency  constraints  such  as  keeping  track  of  physical 
locations  should  not  be  a  concern  of  a  level  one  designer. 

Let  us  develop  the  example  one  step  further.  Suppose  that  we  have  ten  such  Arobjects  Disk-1 . Disk- 10. 

In  this  case,  the  abstract  state  space  at  level  one  is  the  Cartesian  products  of  the  abstract  state  space  provided 
by  each  of  the  disks.  That  is,  the  state  space  at  level  one  is  all  the  possible  combinations  of  allocating  (10  x 
100.000)  blocks  to  users.  At  level  one,  Arobject  Disk-System  can  define  new  consistency  constraints  upon  the 
abstract  state  space  provided  by  level  zero.  For  example,  in  addition  to  provide  800,000  blocks  of  abstract 
storage  state  space  to  the  Arobject  File-System  at  level  two,  Disk-System  can  pair  Disk-1  and  Disk-2  to 
provide  an  abstract  state  space  of  100,000  block  "reliable  storage"  that  is  able  to  survive  a  single  disk  failure. 
Arobject  Disk-System  uses  transactions  Allocate-Blocks  and  Deallocate- Blocks  at  level  one  to  produce  trans¬ 
actions  Allocate/Deallocate-Ordinary-Blocks  and  Allocate/Deallocate-Reliable-Blocks  as  primitive  steps  for 
level  two.  Within  the  state  space  provided  by  Disk-System,  the  File-System  at  level  two  can  define  its  own 
consistency  constraints  such  as  dividing  the  space  into  system  space  and  a  set  of  user  spaces  at  level  three. 
Within  a  user’s  space,  one  can  define  his  own  Arobject  such  as  queues,  stacks  and  records.  Users  can  continue 
the  layering  of  their  own  Arobjects. 

Despite  the  conceptual  development  of  Arobject,  there  is  still  a  number  of  issues  that  need  to  be  addressed 
in  both  the  development  and  formalization  of  this  new  model  of  computation.  Currently,  Tokuda.  Dark  and 
Locke  are  developing  and  implementing  their  Arobjects  and  the  ArchOS  decentralized  operating  system,  and 


526 


wc  arc  in  the  process  of  developing  a  formal  model  of  the  concurrency  control  and  failure  recovery  rules  in 
the  context  of  distributed  abstract  daui  types  such  as  Arobjccts. 


527 


(Allchin  821 


(Bceri  83] 


[Bernstein  79] 


[Bernstein  83] 


[Eswaran  76] 


[Goree  83] 


[Jensen  83] 


[Jensen  84] 


[Korth  83] 


(Kung79] 


References 

Allchin,  James  R.  and  Martin  S.  McKcndry. 

Object  Based  Synchronization  and  Recovery. 

Technical  Report  017-105-82/15,  School  of  Information  and  Computer  Science,  Georgia 
Institute  of  Technology,  1982. 

Beeri,  C.,  P.  A.  Bernstein.  N.  Goodman,  and  M.  Y.  Lai. 

A  Concurrency  Control  'ITicory  for  Nested  Transactions. 

ACM  SIGACT-SfGOPS  Symposium  on  Principles  of  Distributed  Computing .  1983. 

Bernstein.  Phillip  A.,  David  W.  Shipman,  and  Wing  S.  Wong. 

Formal  Aspects  of  Serializability  in  Database  Concurrency  Control. 

/EEC  Transactions  on  Software  Engineering  :  pages  203  -  216, 1979. 

Bernstein,  P.  A.,  Goodman,  N.  and  Hadziacos  V. 

Recovery  Algorithms  for  Database  Systems. 

Technical  Report.  Aiken  Compulation  Laboratory,  Harvard  University .  March  1983. 

Eswaran,  K.  P.,  J.  N.  Gray.  R.  A.  Lorie  and  1.  L.  Traiger. 

The  Notion  of  Consistency  and  Predicate  Lock  in  a  Database  System. 

CACM.im. 

Goree,  J.  A.  Jr. 

Internal  Consistency  of  A  Distributed  Transaction  System  with  Orphan  Detection. 
Master’s  thesis.  Laboratory  for  Computer  Science,  M.I.T.,  1983. 

Jensen,  E.  Douglas. 

The  Archons  Project;  An  Overview. 

Proc.  International  Symposium  on  Synchronization,  Control,  and  Communication  in  Dis¬ 
tributed  Systems,’ Academic  Press ,  1983. 

Jensen,  E  D.,  Pleszkoch,  N. 

ArchOS:  A  Physically  Dispersed  Operaing  System  —  An  Overview  of  Its  Objectives  and 
Approach. 

IEEE  Distributed  Processing  Technical  Committee  Newsletter,  Special  Issue  on  Distributed 
Operating  Systems ,  June,  1984. 

Korth,  H.  F. 

Locking  Primitives  in  a  Database  System. 

JACM,  VoL  30,  No.  I ,  January,  1983. 

Kung,  H.  T.  and  C.  H.  Papadimitriou. 

An  Optimal  Theory  of  Concurrency  Control  for  Databases. 

In  Proceedings  of  the  SIGMOD  International  Conference  on  Management  of  Data,  pages 
116-126.  ACM.  1979. 


528 


[Lynch  83a) 


[Lynch  83bl 


[Molina  83j 


[Moss  81] 


[Schwarz  82] 


[Sha  83] 


[Silberschau  80] 


[Tokiida  85] 


[Weihl  84] 


[Wood  80] 


Lynch,  N.  A. 

Concurrency  Control  for  Resilient  Nested  Transactions. 

/Voc.  2'’'^  SIGACT-SIGMOD  Cunf.  un  Principle  of  Database  Systems ,  1983. 

Lynch.  N.  A. 

Mulii-lcvcl  Atomicity  -  A  New  Correctness  Criterion  for  Database  Concurrency  Control. 
ACM  Transaction  on  Database  Systems.  Vol.  8.  No.  4  .  December,  1983. 

Garcia-Moiina.  H. 

Using  Semantic  Knowledge  For  Transaction  Processing  In  A  Distributed  Database. 

ACM  Transaction  on  Database  Systems.  Vol  8,  No.  2 ,  June.  1983. 

Moss.  E. 

Nested  Transactions:  An  Approach  to  Reliable  Distributed  Computing. 

Technical  Report.  M.l.T.  Laboratory  for  Computer  Science  TR  260,  Apr.  1981. 

Schwarz.  Peter  M.  and  Alfred  Z.  Spector. 

Synchronizing  Shared  Abstract  Types. 

Technical  Report  C.VIU-CS-82-128,  Department  of  Computer  Science,  Carnegie- .Mellon 
University,  1982. 

Sha.  Lui,  E.  Douglas  Jensen,  Richard  F.  Rashid,  and  J.  Duane  Northcuit. 

Distributed  Co-operating  Processes  and  Transactions, 

Proceedings  of  ACM  S/GCOMM  symposium .  1983. 

Silberschatz,  A.,  and  Z.  Kedem. 

Consistency  in  Hierarchical  Database  Systems. 

JACM  27:1 , 1980. 

Tokuda,  H.,  Clark,  R.  K.  and  Locke,  C.  D. 

Archons  Operating  System  (ArchOS)  —  Client  Interface.  Pan  II. 

Technical  Report,  Department  of  Computer  Science,  Carnegic-Mcllon  University,  1985. 

Weihl,  W.  E. 

Specification  and  Implementation  of  Atomic  Data  Types. 

PhD  thesis.  .Massachusetts  Institute  of  Technology,  1984. 

Wood,  W.  G. 

Recovery  Control  of  Communication  Processes  in  a  Distributed  System. 

Technical  Report,  Technical  Report  158,  University  of  Newcastle  upon  Tyne.  1980. 


i  MISSION  I 

k  of  ^ 

B  Rome  Air  Development  Center  ? 


RADC  plans  and  executes  research,  development,  test  and  selected 
acquisition  programs  in  support  of  Command,  Control,  Communications 
and  Intelligence  (C^f)  activities.  Technical  and  engineering  support  within 
areas  of  competence  is  provided  to  ESD  Program  Offices  (POs)  and  other 
ESD  elements  to  perform  effective  acquisition  of  C^ I  systems.  The  areas 
of  technical  competence  include  communications,  command  and  control, 
battle  management,  information  processing,  surveillance  sensors, 
intelligence  data  collection  and  handling,  solid  state  sciences, 
electromagnetics,  and  propagation,  and  electronic,  maintainability,  and 
compatibility. 


